Suppr超能文献

为了更好地理解基于短读长的变异检测工具对插入变异的低召回率。

Towards a better understanding of the low recall of insertion variants with short-read based variant callers.

作者信息

Delage Wesley J, Thevenon Julien, Lemaitre Claire

机构信息

Univ Rennes, Inria, CNRS, IRISA, Rennes, F-35000, France.

Inserm U1209, CNRS UMR 5309, Univ. Grenoble Alpes, Institute for Advanced Biosciences, Grenoble, France & Genetics, Genomics and Reproduction Service, Centre Hospitalo-Universitaire Grenoble-Alpes, Grenoble, France.

出版信息

BMC Genomics. 2020 Nov 4;21(1):762. doi: 10.1186/s12864-020-07125-5.

Abstract

BACKGROUND

Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools.

RESULTS

In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls.

CONCLUSIONS

Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.

摘要

背景

自2009年以来,已经开发了许多使用短读长技术来检测结构变异的工具。大于50 bp的插入是最难发现的类型之一,在金标准变异调用集中的代表性严重不足。长读长技术的出现彻底改变了这种情况。2019年,两项独立的跨技术研究发表了人类个体中最完整的具有序列解析插入的变异调用集。在报告的插入中,基于短读长的工具只能发现17%至28%。

结果

在这项工作中,我们对这些前所未有的插入调用集进行了深入分析,以调查此类失败的原因。我们首先根据四层特征对插入变异进行了精确分类:插入序列的性质和大小、插入位点的基因组背景以及断点连接的复杂性。由于这些层面相互交织,我们随后使用模拟来表征每个复杂性因素对几种结构变异调用者召回率的影响。我们表明,大多数报告的插入表现出可能干扰其发现的特征:63%是串联重复扩增,38%在其断点连接内包含大于10 bp的同源性,70%位于简单重复序列中。因此,对于此类插入,基于短读长的变异调用者的召回率显著较低(串联重复为6%,而移动元件插入为56%)。模拟表明,最具影响的因素是插入类型而非基因组背景,在测试的结构变异调用者中,各种困难的处理方式不同,并且它们突出了大多数插入调用缺乏序列分辨率的问题。

结论

我们的结果通过指出观察到的插入特征中的几个困难因素解释了召回率低的原因,并为改进结构变异调用者算法及其组合提供了途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/73cd/7640490/66ab9384bbca/12864_2020_7125_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验