Suppr超能文献

SparkEC:加速基于比对的 DNA 纠错工具。

SparkEC: speeding up alignment-based DNA error correction tools.

机构信息

Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain.

出版信息

BMC Bioinformatics. 2022 Nov 7;23(1):464. doi: 10.1186/s12859-022-05013-1.

Abstract

BACKGROUND

In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance.

RESULTS

In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text], respectively, over its counterpart.

CONCLUSION

As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

摘要

背景

近年来,在所谓的下一代测序(NGS)背景下,对基因组数据进行测序方面取得了巨大的进步。然而,当前 NGS 平台生成的 DNA 读取并非没有错误,这会影响下游分析的质量。尽管可以作为预处理步骤来执行纠错以克服此问题,但通常需要很长的计算时间来分析当今通过 NGS 生成的那些大型数据集。因此,能够在具有高性能的节点集群上扩展的新软件非常重要。

结果

在本文中,我们介绍了 SparkEC,这是一种能够纠正测序过程中产生的错误的并行工具。为此,已经分析并优化了 CloudEC 工具中提出的算法,该算法已被证明可以进行准确的纠错,通过依赖 Apache Spark 框架并引入其他增强功能(例如使用内存高效的数据结构和避免任何输入预处理)来提高其性能。实验结果表明,与 CloudEC 相比,SparkEC 在所有代表性数据集和评估场景下的计算时间都有了显著的提高,平均和最大加速分别为 4.9 和 11.9,优于其对应的工具。

结论

由于纠错可能需要过多的计算时间,因此 SparkEC 为纠正大型数据集提供了一种可扩展的解决方案。由于其分布式实现,SparkEC 的速度可以随着集群中节点数量的增加而增加。此外,该软件根据 GPLv3 许可证免费提供,并且与不同的操作系统(Linux、Windows 和 macOS)兼容。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7dba/9639292/92f0ac653454/12859_2022_5013_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验