Suppr超能文献

利用家系估计测序错误率。

Estimating sequencing error rates using families.

作者信息

Paskov Kelley, Jung Jae-Yoon, Chrisman Brianna, Stockham Nate T, Washington Peter, Varma Maya, Sun Min Woo, Wall Dennis P

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Department of Pediatrics (Systems Medicine), Stanford University, Stanford, CA, USA.

出版信息

BioData Min. 2021 Apr 23;14(1):27. doi: 10.1186/s13040-021-00259-6.

Abstract

BACKGROUND

As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample.

RESULTS

We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites.

CONCLUSION

Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

摘要

背景

随着下一代测序技术进入临床应用,如果要利用这些技术指导患者治疗,了解其错误率至关重要。然而,测序平台和变异检测流程在不断发展,这使得难以准确量化每个样本所使用的检测方法和软件参数的特定组合的错误率。家系数据为估计测序错误率提供了独特的机会,因为它使我们能够将一部分测序错误视为家系中的孟德尔错误,然后我们可以利用这些错误来生成每个样本的全基因组错误估计值。

结果

我们引入了一种方法,该方法利用测序数据中的孟德尔错误,对任何一组变异检测结果进行高度细化的每个样本的精度和召回率估计,而无需考虑测序平台或检测方法。我们使用同卵双胞胎验证了估计值的准确性,并使用一组同卵四胞胎表明我们的预测与共识方法密切匹配。我们通过估计全基因组测序、全外显子组测序和微阵列数据集的测序错误率,展示了我们方法的通用性,并通过量化不同版本的GATK变异检测流程之间的性能提升,突出了其敏感性。然后,我们使用我们的方法证明:1)同一数据集中样本之间的测序错误率可能相差一个数量级以上。2)在基因组的低复杂度区域,变异检测性能大幅下降。3)全外显子组测序数据中的变异检测性能随着与最近目标区域距离的增加而下降。4)来自淋巴母细胞系的变异检测结果与来自全血的结果一样准确。5)在疾病相关的单核苷酸变异(SNV)位点,全基因组测序可以达到微阵列级别的精度和召回率。

结论

来自家系的基因型数据集是强大的资源,可用于对任何测序平台和变异检测方法进行细粒度的测序错误估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05a0/8063364/a6edc702d4ee/13040_2021_259_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验