Suppr超能文献

SeQual-Stream:面向 NGS 数据集质量控制的流处理方法。

SeQual-Stream: approaching stream processing to quality control of NGS datasets.

机构信息

Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain.

出版信息

BMC Bioinformatics. 2023 Oct 27;24(1):403. doi: 10.1186/s12859-023-05530-7.

Abstract

BACKGROUND

Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing.

RESULTS

In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features.

CONCLUSION

Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .

摘要

背景

在许多基因组分析中,DNA 序列的质量控制是数据预处理的重要步骤。然而,所有现有的并行工具都是基于批处理模型的,需要在处理开始之前拥有完整的遗传数据集。这种局限性显然阻碍了在数据集必须从远程存储库下载和/或复制到分布式文件系统以进行并行处理的情况下进行质量控制的性能。

结果

在本文中,我们提出了 SeQual-Stream,这是一种流式工具,可快速、分布式且可扩展地对基因组数据集执行多种质量控制操作。为此,我们的方法依赖于 Apache Spark 框架和 Hadoop 分布式文件系统(HDFS),以充分利用流范例并加速大型数据集的预处理,因为它们正在被下载和/或复制到 HDFS。实验结果表明,与具有类似质量控制功能的批处理工具相比,SeQual-Stream 的执行时间有了显著提高,当处理超过 2.5 亿个 DNA 序列的数据集时,提供了高达 2.7[Formula: see text]的最大加速,同时还展示了良好的可扩展性特征。

结论

我们的解决方案通过利用流处理功能,提供了一种更具可扩展性和更高性能的方法来对大型基因组数据集进行质量控制。该工具作为免费的开源软件以 GNU AGPLv3 许可证发布,并可在 https://github.com/UDC-GAC/SeQual-Stream 上公开下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7dcc/10612204/d38efc42fbb5/12859_2023_5530_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验