Suppr超能文献

BigSeqKit:一个用于大规模处理 FASTA 和 FASTQ 文件的并行大数据工具包。

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.

机构信息

CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela 15782, Spain.

出版信息

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad062. Epub 2023 Jul 31.

Abstract

BACKGROUND

High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node.

RESULTS

Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line.

CONCLUSIONS

BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.

摘要

背景

高通量测序技术使得可利用的测序数据量呈前所未有的爆炸式增长,这些数据通常使用 FASTA 和 FASTQ 文件存储。我们可以在文献中找到几种工具,用于处理和操作这些类型的文件,目的是将序列数据转化为生物学知识。然而,它们都不适合高效地处理非常大的文件,因为它们基于顺序处理,可能在未来几年内达到 TB 级。只有知名的 seqkit 工具的一些例程部分并行化。无论如何,其可扩展性限于在单个计算节点上使用少量线程。

结果

我们的方法 BigSeqKit 利用高性能计算-大数据框架来并行化和优化 seqkit 中的命令,旨在加快 FASTA/FASTQ 文件的操作。这样,在大多数情况下,它比几种最先进的工具快数十到数百倍。同时,我们的工具包易于在任何类型的硬件平台(本地服务器或集群)上使用和安装,并且其例程可以用作生物信息学库或从命令行使用。

结论

BigSeqKit 是一个非常完整和超快速的工具包,用于处理和操作大型 FASTA 和 FASTQ 文件。它可在 https://github.com/citiususc/BigSeqKit 上获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验