BigSeqKit：一个用于大规模处理 FASTA 和 FASTQ 文件的并行大数据工具包。

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.

机构信息

CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela 15782, Spain.

出版信息

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad062. Epub 2023 Jul 31.

DOI:10.1093/gigascience/giad062

PMID:37522758

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10388699/

Abstract

BACKGROUND

High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node.

RESULTS

Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line.

CONCLUSIONS

BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.

摘要

背景

高通量测序技术使得可利用的测序数据量呈前所未有的爆炸式增长，这些数据通常使用 FASTA 和 FASTQ 文件存储。我们可以在文献中找到几种工具，用于处理和操作这些类型的文件，目的是将序列数据转化为生物学知识。然而，它们都不适合高效地处理非常大的文件，因为它们基于顺序处理，可能在未来几年内达到 TB 级。只有知名的 seqkit 工具的一些例程部分并行化。无论如何，其可扩展性限于在单个计算节点上使用少量线程。

结果

我们的方法 BigSeqKit 利用高性能计算-大数据框架来并行化和优化 seqkit 中的命令，旨在加快 FASTA/FASTQ 文件的操作。这样，在大多数情况下，它比几种最先进的工具快数十到数百倍。同时，我们的工具包易于在任何类型的硬件平台（本地服务器或集群）上使用和安装，并且其例程可以用作生物信息学库或从命令行使用。

结论

BigSeqKit 是一个非常完整和超快速的工具包，用于处理和操作大型 FASTA 和 FASTQ 文件。它可在 https://github.com/citiususc/BigSeqKit 上获得。

相似文献

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.BigSeqKit：一个用于大规模处理 FASTA 和 FASTQ 文件的并行大数据工具包。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad062. Epub 2023 Jul 31.

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.SeqKit：一个用于FASTA/Q文件操作的跨平台超快速工具包。

PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. eCollection 2016.

Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files.Pyfastx：一个强大的 Python 包，用于快速随机访问来自普通和 gzipped FASTA/Q 文件的序列。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa368.

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.FASTA/Q 数据压缩器在 MapReduce-Hadoop 基因组学中的应用：轻松节省空间和时间。

BMC Bioinformatics. 2021 Mar 22;22(1):144. doi: 10.1186/s12859-021-04063-1.

SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files.SeqFu：一套用于对序列文件进行稳健且可重复操作的实用工具。

Bioengineering (Basel). 2021 May 7;8(5):59. doi: 10.3390/bioengineering8050059.

GTZ: a fast compression and cloud transmission tool optimized for FASTQ files.GTZ：一款针对 FASTQ 文件优化的快速压缩和云传输工具。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):549. doi: 10.1186/s12859-017-1973-5.

RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms.RabbitFX：适用于现代多核平台的 FASTA/Q 文件解析的高效框架。

IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2341-2348. doi: 10.1109/TCBB.2022.3219114. Epub 2023 Jun 5.

CIndex: compressed indexes for fast retrieval of FASTQ files.CIndex：用于快速检索FASTQ文件的压缩索引。

Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.

Cryfa: a secure encryption tool for genomic data.Cryfa：一种用于基因组数据的安全加密工具。

Bioinformatics. 2019 Jan 1;35(1):146-148. doi: 10.1093/bioinformatics/bty645.

FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications.FASTdoop：一个通用且高效的库，用于为MapReduce Hadoop生物信息学应用输入FASTA和FASTQ文件。

Bioinformatics. 2017 May 15;33(10):1575-1577. doi: 10.1093/bioinformatics/btx010.

引用本文的文献

Isolation, complete characterization and phylogeography of the first bacteriophage against , which encodes a pyruvate phosphate dikinase and represents a novel viral family.首个针对编码丙酮酸磷酸二激酶的噬菌体的分离、完整表征及系统地理学研究，该噬菌体代表一个新的病毒家族。

Microb Genom. 2025 Apr;11(4). doi: 10.1099/mgen.0.001403.

Systematic Analysis of the Gene Family and Its Expression Profile Identifies Potential Key Candidate Genes Involved in Abiotic Stress Responses.基因家族的系统分析及其表达谱鉴定出参与非生物胁迫响应的潜在关键候选基因。

Plants (Basel). 2025 Mar 11;14(6):880. doi: 10.3390/plants14060880.

Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.针对海量分类数据集的高效系统发育树推断：利用服务器的能力分析100万个分类单元。

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae055.

SeqKit2: A Swiss army knife for sequence and alignment processing.SeqKit2：一款用于序列和比对处理的瑞士军刀式工具。

Imeta. 2024 Apr 5;3(3):e191. doi: 10.1002/imt2.191. eCollection 2024 Jun.

本文引用的文献

Ensembl 2022.Ensembl 2022.

Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995. doi: 10.1093/nar/gkab1049.

Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa368.

The International Genome Sample Resource (IGSR) collection of open human genomic variation resources.国际基因组样本资源（IGSR）汇集了开放的人类基因组变异资源。

Nucleic Acids Res. 2020 Jan 8;48(D1):D941-D947. doi: 10.1093/nar/gkz836.

Singularity: Scientific containers for mobility of compute.奇点：用于计算移动性的科学容器。

PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017.

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.SeqKit：一个用于FASTA/Q文件操作的跨平台超快速工具包。

PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. eCollection 2016.

fqtools: an efficient software suite for modern FASTQ file manipulation.fqtools：一个用于现代FASTQ文件操作的高效软件套件。

Bioinformatics. 2016 Jun 15;32(12):1883-4. doi: 10.1093/bioinformatics/btw088. Epub 2016 Feb 18.

HTSeq--a Python framework to work with high-throughput sequencing data.HTSeq——一个用于处理高通量测序数据的Python框架。

Bioinformatics. 2015 Jan 15;31(2):166-9. doi: 10.1093/bioinformatics/btu638. Epub 2014 Sep 25.

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.Sanger 测序的 FASTQ 文件格式，用于包含质量分数的序列，以及 Solexa/Illumina FASTQ 变体。

Nucleic Acids Res. 2010 Apr;38(6):1767-71. doi: 10.1093/nar/gkp1137. Epub 2009 Dec 16.

Biopython: freely available Python tools for computational molecular biology and bioinformatics.Biopython：用于计算分子生物学和生物信息学的免费可用Python工具。

Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163. Epub 2009 Mar 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

BigSeqKit：一个用于大规模处理 FASTA 和 FASTQ 文件的并行大数据工具包。

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献