Department of Computer Science and Engineering, University of California, San Diego, San Diego CA, USA.
Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego CA, USA.
Nat Biotechnol. 2022 Jul;40(7):1075-1081. doi: 10.1038/s41587-022-01220-6. Epub 2022 Feb 28.
Although most existing genome assemblers are based on de Bruijn graphs, the construction of these graphs for large genomes and large k-mer sizes has remained elusive. This algorithmic challenge has become particularly pressing with the emergence of long, high-fidelity (HiFi) reads that have been recently used to generate a semi-manual telomere-to-telomere assembly of the human genome. To enable automated assemblies of long, HiFi reads, we present the La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads by three orders of magnitude, constructs the de Bruijn graph for large genomes and large k-mer sizes and transforms it into a multiplex de Bruijn graph with varying k-mer sizes. Compared to state-of-the-art assemblers, our algorithm not only achieves five-fold fewer misassemblies but also generates more contiguous assemblies. We demonstrate the utility of LJA via the automated assembly of a human genome that completely assembled six chromosomes.
虽然大多数现有的基因组组装器都是基于 de Bruijn 图构建的,但对于大型基因组和大 k-mer 大小的 de Bruijn 图的构建仍然难以实现。随着最近用于生成人类基因组的半手动端粒到端粒组装的长、高保真 (HiFi) 读取的出现,这个算法挑战变得尤为紧迫。为了实现长的 HiFi 读取的自动化组装,我们提出了拉霍亚组装器(LJA),这是一种使用布隆过滤器、稀疏 de Bruijn 图和不相交生成的快速算法。LJA 将 HiFi 读取的错误率降低了三个数量级,为大型基因组和大 k-mer 大小构建了 de Bruijn 图,并将其转换为具有不同 k-mer 大小的多路 de Bruijn 图。与最先进的组装器相比,我们的算法不仅实现了误组装数量减少五倍,而且还生成了更多连续的组装。我们通过自动化组装一个完全组装了六个染色体的人类基因组来展示 LJA 的实用性。