长读与长读组装检测到的结构变异的比较和基准测试。

Comparison and benchmark of structural variants detected from long read and long-read assembly.

机构信息

MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.

School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.

出版信息

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad188.

Abstract

Structural variant (SV) detection is essential for genomic studies, and long-read sequencing technologies have advanced our capacity to detect SVs directly from read or de novo assembly, also known as read-based and assembly-based strategy. However, to date, no independent studies have compared and benchmarked the two strategies. Here, on the basis of SVs detected by 20 read-based and eight assembly-based detection pipelines from six datasets of HG002 genome, we investigated the factors that influence the two strategies and assessed their performance with well-curated SVs. We found that up to 80% of the SVs could be detected by both strategies among different long-read datasets, whereas variant type, size, and breakpoint detected by read-based strategy were greatly affected by aligners. For the high-confident insertions and deletions at non-tandem repeat regions, a remarkable subset of them (82% in assembly-based calls and 93% in read-based calls), accounting for around 4000 SVs, could be captured by both reads and assemblies. However, discordance between two strategies was largely caused by complex SVs and inversions, which resulted from inconsistent alignment of reads and assemblies at these loci. Finally, benchmarking with SVs at medically relevant genes, the recall of read-based strategy reached 77% on 5X coverage data, whereas assembly-based strategy required 20X coverage data to achieve similar performance. Therefore, integrating SVs from read and assembly is suggested for general-purpose detection because of inconsistently detected complex SVs and inversions, whereas assembly-based strategy is optional for applications with limited resources.

摘要

结构变异 (SV) 检测对于基因组研究至关重要,长读测序技术提高了我们从读取或从头组装中直接检测 SV 的能力,也称为基于读取和基于组装的策略。然而,迄今为止,尚无独立的研究比较和基准测试这两种策略。在这里,基于 HG002 基因组六个数据集的 20 个基于读取和八个基于组装的检测管道检测到的 SV,我们研究了影响这两种策略的因素,并使用精心筛选的 SV 评估了它们的性能。我们发现,在不同的长读数据集之间,多达 80%的 SV 可以通过这两种策略检测到,而基于读取的策略检测到的变异类型、大小和断点受对齐器的影响很大。对于非串联重复区域的高置信插入和缺失,其中一个显著的子集(基于组装的调用中有 82%,基于读取的调用中有 93%),约有 4000 个 SV,可以被读取和组装同时捕获。然而,两种策略之间的不一致主要是由于复杂的 SV 和倒位引起的,这些是由于在这些位置上读取和组装的不一致对齐造成的。最后,在与医学相关基因的 SV 进行基准测试时,基于读取的策略在 5X 覆盖数据上的召回率达到 77%,而基于组装的策略需要 20X 覆盖数据才能达到类似的性能。因此,由于复杂的 SV 和倒位的检测不一致,建议将读取和组装的 SV 进行整合,用于通用检测,而基于组装的策略对于资源有限的应用程序是可选的。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索