Suppr超能文献

来自串联重复变异全基因组真值集的见解。

Insights from a genome-wide truth set of tandem repeat variation.

作者信息

Weisburd Ben, Tiao Grace, Rehm Heidi L

机构信息

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.

出版信息

bioRxiv. 2023 May 8:2023.05.05.539588. doi: 10.1101/2023.05.05.539588.

Abstract

Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output.

摘要

在过去十年中,用于从短读长测序数据中进行串联重复序列(TRs)基因分型的工具已经有了显著改进。将这些工具与像RP-PCR这样的金标准诊断方法进行的广泛比较,已经证实了它们对于数十到数百个经过充分研究的位点的准确性。然而,高质量正交真值数据的稀缺限制了我们测量整个基因组中数百万其他位点的工具准确性的能力。为了解决这个问题,我们基于合成二倍体基准(SynDip)开发了一个TR真值集。通过识别代表TR扩增或收缩的插入和缺失子集,其基序长度在2到50个碱基对之间,我们在一个单倍体样本中获得了139,795个纯合和6,845个中断重复序列的准确基因型。我们的方法不需要在短读长或长读长测序数据上运行现有的基因分型工具,并提供了串联重复序列变异的另一种更准确的视角。我们应用这个真值集来比较广泛使用的TR基因分型工具的优缺点,评估现有全基因组TR目录的完整性,并探索整个基因组中串联重复序列变异的特性。我们发现,在不进行过滤的情况下,在广泛的基序和等位基因大小范围内,ExpansionHunter比GangSTR和HipSTR具有更高的准确性。此外,当等位基因大小出现错误时,ExpansionHunter倾向于高估扩增大小,而GangSTR倾向于低估它们。另外,我们发现广泛使用的TR目录在真值集中遗漏了16%到41%的变异位点。这些结果表明,全基因组分析将受益于对更大位点集进行基因分型以及基于当前算法优势的进一步工具开发。为此,我们开发了一个包含280万个位点的新目录,该目录捕获了真值集中95%的变异位点,并创建了一个ExpansionHunter的修改版本,其运行速度比原始版本快2到3倍,同时产生相同的输出。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c636/10197592/284690ec7303/nihpp-2023.05.05.539588v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验