Department of Electrical and Computer Engineering.
TECNUN School of Engineering, University of Navarra, Donostia 20018, Gipuzkoa, Spain.
Bioinformatics. 2020 Sep 15;36(18):4810-4812. doi: 10.1093/bioinformatics/btaa604.
Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval.
We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.
GPress is freely available at https://github.com/qm2/gpress.
Supplementary data are available at Bioinformatics online.
测序数据通常在不同的注释级别进行汇总,以进行进一步的分析,通常使用通用特征格式(GFF)或其后代基因转移格式(GTF)和 GFF3。现有的访问这些文件的实用程序,如 gffutils 和 gffread,并不专注于减少存储空间,在某些情况下会显著增加存储空间。我们提出了 GPress,这是一个用于查询压缩形式的 GFF 文件的框架。GPress 还可以合并和压缩来自批量和单细胞 RNA-Seq 实验的表达文件,支持同时对 GFF 和表达文件进行查询。简而言之,GPress 对数据进行转换,然后使用通用无损压缩器 BSC 对其进行压缩。为了支持查询,GPress 按块压缩数据,并创建几个索引表以快速检索。
我们在几个不同生物体的 GFF 文件上测试了 GPress,并表明它相对于 gzip(当前 GFF 文件的事实上的压缩器)平均实现了 61%的大小减少,同时能够在几秒钟内(在普通笔记本电脑上运行时)检索到给定标识符或坐标范围的所有注释。相比之下,gffutils 提供了更快的检索速度,但会将 GFF 文件的大小增加一倍。当另外链接一个表达文件时,我们表明,与 gzip 相比,GPress 可以将其大小减少 68%以上(对于批量和单细胞 RNA-Seq 实验),同时仍然在几秒钟内检索信息。最后,将 BSC 应用于 GPress 生成的数据流而不是原始文件,平均可以减少超过 44%的大小。
GPress 可在 https://github.com/qm2/gpress 上免费获得。
补充数据可在 Bioinformatics 在线获得。