Suppr
超能文献

SparkEC：加速基于比对的 DNA 纠错工具。

SparkEC: speeding up alignment-based DNA error correction tools.

机构信息

Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain.

出版信息

BMC Bioinformatics. 2022 Nov 7;23(1):464. doi: 10.1186/s12859-022-05013-1.

DOI:10.1186/s12859-022-05013-1

PMID:36344928

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9639292/

Abstract

BACKGROUND

In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance.

RESULTS

In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text], respectively, over its counterpart.

CONCLUSION

As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

摘要

背景

近年来，在所谓的下一代测序（NGS）背景下，对基因组数据进行测序方面取得了巨大的进步。然而，当前 NGS 平台生成的 DNA 读取并非没有错误，这会影响下游分析的质量。尽管可以作为预处理步骤来执行纠错以克服此问题，但通常需要很长的计算时间来分析当今通过 NGS 生成的那些大型数据集。因此，能够在具有高性能的节点集群上扩展的新软件非常重要。

结果

在本文中，我们介绍了 SparkEC，这是一种能够纠正测序过程中产生的错误的并行工具。为此，已经分析并优化了 CloudEC 工具中提出的算法，该算法已被证明可以进行准确的纠错，通过依赖 Apache Spark 框架并引入其他增强功能（例如使用内存高效的数据结构和避免任何输入预处理）来提高其性能。实验结果表明，与 CloudEC 相比，SparkEC 在所有代表性数据集和评估场景下的计算时间都有了显著的提高，平均和最大加速分别为 4.9 和 11.9，优于其对应的工具。

结论

由于纠错可能需要过多的计算时间，因此 SparkEC 为纠正大型数据集提供了一种可扩展的解决方案。由于其分布式实现，SparkEC 的速度可以随着集群中节点数量的增加而增加。此外，该软件根据 GPLv3 许可证免费提供，并且与不同的操作系统（Linux、Windows 和 macOS）兼容。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7dba/9639292/92f0ac653454/12859_2022_5013_Fig1_HTML.jpg

相似文献

SparkEC: speeding up alignment-based DNA error correction tools.

BMC Bioinformatics. 2022 Nov 7;23(1):464. doi: 10.1186/s12859-022-05013-1.

SeQual-Stream: approaching stream processing to quality control of NGS datasets.

BMC Bioinformatics. 2023 Oct 27;24(1):403. doi: 10.1186/s12859-023-05530-7.

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

BMC Bioinformatics. 2022 Jun 13;23(1):227. doi: 10.1186/s12859-022-04754-3.

EC: an efficient error correction algorithm for short reads.

BMC Bioinformatics. 2015;16 Suppl 17(Suppl 17):S2. doi: 10.1186/1471-2105-16-S17-S2. Epub 2015 Dec 7.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

Blue: correcting sequencing errors using consensus and context.

Bioinformatics. 2014 Oct;30(19):2723-32. doi: 10.1093/bioinformatics/btu368. Epub 2014 Jun 11.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.

BMC Bioinformatics. 2022 Jan 6;23(1):25. doi: 10.1186/s12859-021-04547-0.

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework.

PLoS One. 2019 Dec 5;14(12):e0224784. doi: 10.1371/journal.pone.0224784. eCollection 2019.

Repeat-aware modeling and correction of short read errors.

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S52. doi: 10.1186/1471-2105-12-S1-S52.

引用本文的文献

Framing Apache Spark in life sciences.

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

本文引用的文献

CARE: context-aware sequencing read error correction.

Bioinformatics. 2021 May 17;37(7):889-895. doi: 10.1093/bioinformatics/btaa738.

Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.

BMC Bioinformatics. 2019 Jun 3;20(1):298. doi: 10.1186/s12859-019-2906-2.

IMOS: improved Meta-aligner and Minimap2 On Spark.

BMC Bioinformatics. 2019 Jan 24;20(1):51. doi: 10.1186/s12859-018-2592-5.

HSRA: Hadoop-based spliced read aligner for RNA sequencing data.

PLoS One. 2018 Jul 31;13(7):e0201483. doi: 10.1371/journal.pone.0201483. eCollection 2018.

Evaluation of the impact of Illumina error correction tools on de novo genome assembly.

BMC Bioinformatics. 2017 Aug 18;18(1):374. doi: 10.1186/s12859-017-1784-8.

MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.

Bioinformatics. 2017 Sep 1;33(17):2762-2764. doi: 10.1093/bioinformatics/btx307.

Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1378-1388. doi: 10.1109/TCBB.2016.2586060. Epub 2016 Jun 29.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

PLoS One. 2016 May 16;11(5):e0155461. doi: 10.1371/journal.pone.0155461. eCollection 2016.

BLESS 2: accurate, memory-efficient and fast error correction method.

Bioinformatics. 2016 Aug 1;32(15):2369-71. doi: 10.1093/bioinformatics/btw146. Epub 2016 Mar 24.

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.

Bioinformatics. 2015 Dec 15;31(24):4003-5. doi: 10.1093/bioinformatics/btv506. Epub 2015 Aug 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。