Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
Google Inc., 8002 Zürich, Switzerland.
Bioinformatics. 2022 Jan 12;38(3):604-611. doi: 10.1093/bioinformatics/btab749.
With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes.
We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.
The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2.
Supplementary data are available at Bioinformatics online.
随着测序技术通量的不断提高,跨数万个人类基因组进行结构变异 (SV) 检测已成为可能。与其他类型的 SV 相比,非参考序列 (NRS) 变体由于检测它们的计算复杂性而受到较少关注。在使用短读长数据时,NRS 变体的检测不可避免地涉及从头组装,这需要高质量的高覆盖率序列数据。以前的研究已经证明了如何组合多个基因组的序列数据来可靠地检测 NRS 变体。然而,这些研究中提出的算法对于更大数量的基因组的扩展性有限。
我们引入了 PopIns2,这是一种用于在许多基因组中发现和表征 NRS 变体的工具,其规模比其前身 PopIns 要大得多。在本文中,我们简要概述了 PopIns2 的工作流程,并强调了我们新的算法贡献。我们开发了一种全新的方法,使用彩色 de Bruijn 图将来自许多基因组的未对齐读长的 contig 组装合并为一组 NRS。我们对模拟数据的测试表明,新的合并算法在质量和可靠性方面属于最佳方法之一,并且随着处理的基因组数量的增加,PopIns2 显示出最佳的精度。在 Polaris 多样性队列和一组 1000 个冰岛人类基因组上的结果表明,该应用程序在处理大规模人群数据集方面具有无与伦比的可扩展性。
PopIns2 的源代码可从 https://github.com/kehrlab/PopIns2 获得。
补充数据可在 Bioinformatics 在线获得。