IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):2471-2483. doi: 10.1109/TCBB.2021.3062068. Epub 2022 Aug 8.
Recently, the compacted de Bruijn graph (cDBG) of complete genome sequences was successfully used in read mapping due to its ability to deal with the repetitions in genomes. However, current approaches are not flexible enough to fit frequently building the graphs with different k-mer lengths. Instead of building the graph directly, how can we build the compacted de Bruijin graph of longer k-mer based on the one of short k-mer? In this article, we present StLiter, a novel algorithm to build the compacted de Bruijn graph either directly from genome sequences or indirectly based on the graph of a short k-mer. For 100 simulated human genomes, StLiter can construct the graph of k-mer length 15-18 in 2.5-3.2 hours with maximal ∼70GB memory in the case of without considering the reverese complements of the reference genomes. And it costs 4.5-5.9 hours when considering the reverse complements. In experiments, we compared StLiter with TwoPaCo, the state-of-art method for building the graph, on 4 datasets. For k-mer length 15-18, StLiter can build the graph 5-9 times faster than TwoPaCo using less maximal memory cost. For k-mer length larger than 18, given the graph of a short (k- x)-mer, such as x= 1-2, compared with TwoPaCo building the graph directly, StLiter can also build the graph more efficiently. The source codes of StLiter can be downloaded from web site https://github.com/BioLab-cz/StLiter.
最近,由于其处理基因组中重复序列的能力,完整基因组序列的压缩 de Bruijn 图(cDBG)在读取映射中得到了成功应用。然而,目前的方法不够灵活,无法适应频繁地使用不同 k-mer 长度构建图谱。我们能否不直接构建图谱,而是基于短 k-mer 的图谱构建更长 k-mer 的压缩 de Bruijn 图呢?在本文中,我们提出了一种新颖的算法 StLiter,可以直接从基因组序列构建压缩的 de Bruijn 图,也可以基于短 k-mer 的图谱间接构建。对于 100 个模拟人类基因组,StLiter 可以在不考虑参考基因组反向互补的情况下,在 2.5-3.2 小时内构建 k-mer 长度为 15-18 的图谱,最大使用约 70GB 内存。而在考虑参考基因组反向互补的情况下,构建图谱则需要 4.5-5.9 小时。在实验中,我们在 4 个数据集上比较了 StLiter 和 TwoPaCo,后者是构建图谱的最新方法。对于 k-mer 长度为 15-18,StLiter 可以比 TwoPaCo 快 5-9 倍构建图谱,同时使用的最大内存成本更低。对于 k-mer 长度大于 18,给定短(k-x)-mer 的图谱,例如 x=1-2,与 TwoPaCo 直接构建图谱相比,StLiter 也可以更有效地构建图谱。StLiter 的源代码可以从网站 https://github.com/BioLab-cz/StLiter 下载。