Department of Computer Science, Vanderbilt University, 37235, Nashville, TN, USA.
Department of Biomedical Engineering, Vanderbilt University, 37235, Nashville, TN, USA.
Nat Commun. 2024 Mar 19;15(1):2447. doi: 10.1038/s41467-024-46614-z.
Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Alignment-based methods, favored for their computational efficiency and lower coverage requirements, are prominent. Alternative approaches, relying solely on available reads for de novo genome assembly and employing assembly-based tools for SV detection via comparison to a reference genome, demand significantly more computational resources. However, the lack of comprehensive benchmarking constrains our comprehension and hampers further algorithm development. Here we systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers. Assembly-based tools excel in detecting large SVs, especially insertions, and exhibit robustness to evaluation parameter changes and coverage fluctuations. Conversely, alignment-based tools demonstrate superior genotyping accuracy at low sequencing coverage (5-10×) and excel in detecting complex SVs, like translocations, inversions, and duplications. Our evaluation provides performance insights, highlighting the absence of a universally superior tool. We furnish guidelines across 31 criteria combinations, aiding users in selecting the most suitable tools for diverse scenarios and offering directions for further method development.
长读测序提供了长的连续 DNA 片段,有助于二倍体基因组组装和结构变异 (SV) 的检测。随着数据可用性的增加,高效稳健的 SV 识别算法至关重要。基于比对的方法因其计算效率高和覆盖要求低而受到青睐。替代方法仅依靠可用的读取进行从头组装,并通过与参考基因组进行比较使用组装为基础的工具来检测 SV,这需要显著更多的计算资源。然而,缺乏全面的基准测试限制了我们的理解,并阻碍了进一步的算法发展。在这里,我们系统地比较了 14 种基于读段比对的 SV 调用方法(包括 4 种基于深度学习的方法和 1 种混合方法)和 4 种基于组装的 SV 调用方法,以及 4 种上游比对器和 7 种组装器。基于组装的工具在检测大的 SV 方面表现出色,尤其是插入,并且对评估参数变化和覆盖波动具有稳健性。相比之下,基于比对的工具在低测序覆盖(5-10x)下表现出优越的基因分型准确性,并擅长检测复杂的 SV,如易位、倒位和重复。我们的评估提供了性能见解,突出了不存在普遍优越的工具。我们提供了 31 个标准组合的指南,帮助用户为不同的场景选择最合适的工具,并为进一步的方法发展提供方向。