Pyfastx：一个强大的 Python 包，用于快速随机访问来自普通和 gzipped FASTA/Q 文件的序列。

Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files.

机构信息

Institute for Advanced Study, Chengdu University, Chengdu, China.

College of Life Sciences and Food Engineering, Yibin University, Yibin, China.

出版信息

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa368.

DOI:10.1093/bib/bbaa368

PMID:33341884

Abstract

FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.

摘要

FASTA 和 FASTQ 是最广泛使用的生物数据格式，已成为生物信息学工具之间交换序列数据的事实上的标准。随着下一代测序数据的雪崩，以 FASTA/Q 格式存储和访问的序列数据量正在急剧增加。然而，由于需要将整个索引加载到内存中，现有的工具在随机检索子序列方面效率非常低。此外，由于内存有限，大多数现有工具都没有能力为大型 FASTA/Q 文件构建索引。此外，这些工具不支持随机访问 gzip 压缩的 FASTA/Q 文件中的序列，gzip 是大多数公共数据库广泛采用的数据压缩方法，用于节省存储空间。在这项研究中，我们开发了 pyfastx，这是一个功能强大的 Python 包，带有常用的命令行工具，以克服上述限制。与其他工具相比，pyfastx 在构建索引和随机访问序列方面表现出最高的性能，特别是在处理包含数亿条序列的大型 FASTA/Q 文件时。pyfastx 优于其他工具的一个关键优势是，它提供了一种从 gzip 压缩的 FASTA/Q 文件中直接随机提取子序列的有效方法，而无需事先解压。pyfastx 可以轻松地从 PyPI（https://pypi.org/project/pyfastx）安装，源代码可在 https://github.com/lmdu/pyfastx 上免费获得。