Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
de Bruijn 图在生物信息学中起着至关重要的作用,但它们缺乏通用的可扩展表示。在这里,我们引入了 simplitigs 作为一种紧凑、高效和可扩展的表示形式,并介绍了 ProphAsm 算法,用于快速计算它们。以模式生物和两个细菌泛基因组组装的例子为例,我们将 simplitigs 与 unitigs(现有的最佳表示形式)进行了比较,并证明了 simplitigs 在累积序列长度和数量上有了显著的提高。当与常用的 Burrows-Wheeler Transform 索引结合使用时,simplitigs 减少了内存和索引加载和查询时间,这在 GenBank 细菌泛基因组的大规模示例中得到了验证。