Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland.
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.
High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets.
Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data.
The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc.
Supplementary data are available at Bioinformatics online.
高质量的序列组装是个体完整遗传信息的最终表现。几个正在进行的泛基因组项目正在生成各种物种的高质量组装集。每个项目已经在磁盘上生成了数百千兆字节的组装,这极大地阻碍了这些丰富数据集的分发和访问。
在这里,我们展示了如何将测序基因组的大小缩小 2-3 个数量级。我们的工具比现有程序显著更好地压缩基因组,并且速度更快。此外,它的独特功能是能够在几分之一秒内访问任何(或其部分)连续体,并轻松地将新样本附加到压缩集合中。由于这一点,AGC 不仅可用于备份或传输目的,而且还可用于常见管道中泛基因组序列的常规分析。随着测序技术成本的迅速降低和准确性的提高,我们预计会有更多具有更大样本量的综合泛基因组项目。AGC 很可能成为存储、分发和访问泛基因组数据的基础工具。
AGC 的源代码可在 https://github.com/refresh-bio/agc 上获得。该软件包可通过 Bioconda 在 https://anaconda.org/bioconda/agc 上安装。
补充数据可在生物信息学在线获得。