IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8.
The de Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct the very large de Bruijn graph for sequences up to Tera-base-pair level. Current approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. We propose a lightweight parallel de Bruijn graph construction approach: de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Burrows-Wheeler Transformation (BWT) of the unipaths of the de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. The experimental results demonstrate that, just with a commonly available machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in GenBank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potential in many large scale genomics studies. The deGSM is publicly available at: https://github.com/hitbc/deGSM.
de Bruijn 图是一种用于表示和组织基因组序列的基本数据结构,在各种序列分析任务中发挥着重要作用。随着高通量测序数据的快速发展和组装基因组数量的不断增加,构建高达太字节级别的序列的非常大的 de Bruijn 图的需求很高。当前的方法可能无法承受处理如此大的 de Bruijn 图所需的内存足迹。我们提出了一种轻量级的并行 de Bruijn 图构建方法:可扩展内存中的 de Bruijn 图构建器(deGSM)。deGSM 的主要思想是在恒定的 RAM 空间中有效地构建 de Bruijn 图的单路径的 Burrows-Wheeler 变换(BWT),并将 BWT 转换为原始单元。实验结果表明,仅使用常见的机器,deGSM 就能够处理非常大的基因组序列,例如 GenBank 数据库和云杉 HTS 数据集(9.7 Tbp)中记录的 contigs(305 Gbp)和 scaffolds(1.1 Tbp)。此外,与最先进的方法相比,deGSM 还具有更快或相当的构建速度。deGSM 具有高度的可扩展性和效率,在许多大规模基因组学研究中具有巨大的潜力。deGSM 可在以下网址获得:https://github.com/hitbc/deGSM。