利用家系估计测序错误率。

Estimating sequencing error rates using families.

作者信息

Paskov Kelley, Jung Jae-Yoon, Chrisman Brianna, Stockham Nate T, Washington Peter, Varma Maya, Sun Min Woo, Wall Dennis P

机构信息

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Department of Pediatrics (Systems Medicine), Stanford University, Stanford, CA, USA.

出版信息

BioData Min. 2021 Apr 23;14(1):27. doi: 10.1186/s13040-021-00259-6.

DOI:10.1186/s13040-021-00259-6

PMID:33892748

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8063364/

Abstract

BACKGROUND

As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample.

RESULTS

We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites.

CONCLUSION

Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

摘要

背景

随着下一代测序技术进入临床应用，如果要利用这些技术指导患者治疗，了解其错误率至关重要。然而，测序平台和变异检测流程在不断发展，这使得难以准确量化每个样本所使用的检测方法和软件参数的特定组合的错误率。家系数据为估计测序错误率提供了独特的机会，因为它使我们能够将一部分测序错误视为家系中的孟德尔错误，然后我们可以利用这些错误来生成每个样本的全基因组错误估计值。

结果

我们引入了一种方法，该方法利用测序数据中的孟德尔错误，对任何一组变异检测结果进行高度细化的每个样本的精度和召回率估计，而无需考虑测序平台或检测方法。我们使用同卵双胞胎验证了估计值的准确性，并使用一组同卵四胞胎表明我们的预测与共识方法密切匹配。我们通过估计全基因组测序、全外显子组测序和微阵列数据集的测序错误率，展示了我们方法的通用性，并通过量化不同版本的GATK变异检测流程之间的性能提升，突出了其敏感性。然后，我们使用我们的方法证明：1）同一数据集中样本之间的测序错误率可能相差一个数量级以上。2）在基因组的低复杂度区域，变异检测性能大幅下降。3）全外显子组测序数据中的变异检测性能随着与最近目标区域距离的增加而下降。4）来自淋巴母细胞系的变异检测结果与来自全血的结果一样准确。5）在疾病相关的单核苷酸变异（SNV）位点，全基因组测序可以达到微阵列级别的精度和召回率。

结论

来自家系的基因型数据集是强大的资源，可用于对任何测序平台和变异检测方法进行细粒度的测序错误估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05a0/8063364/a6edc702d4ee/13040_2021_259_Fig1_HTML.jpg

相似文献

Estimating sequencing error rates using families.利用家系估计测序错误率。

BioData Min. 2021 Apr 23;14(1):27. doi: 10.1186/s13040-021-00259-6.

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing.多种变异calling 管道一致性低：外显子组和基因组测序的实际影响。

Genome Med. 2013 Mar 27;5(3):28. doi: 10.1186/gm432. eCollection 2013.

Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers.癌症外显子组测序数据的详细模拟揭示了变异检测工具的差异和常见局限性。

BMC Bioinformatics. 2017 Jan 3;18(1):8. doi: 10.1186/s12859-016-1417-7.

Archived neonatal dried blood spot samples can be used for accurate whole genome and exome-targeted next-generation sequencing.存档的新生儿干血斑样本可用于准确的全基因组和外显子靶向下一代测序。

Mol Genet Metab. 2013 Sep-Oct;110(1-2):65-72. doi: 10.1016/j.ymgme.2013.06.004. Epub 2013 Jun 13.

Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project.全基因组序列数据阿尔茨海默病测序项目中两种调用管道基因型的质量控制和整合。

Genomics. 2019 Jul;111(4):808-818. doi: 10.1016/j.ygeno.2018.05.004. Epub 2018 May 29.

VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering.变异元调用器：用于基于定量、精确性筛选的变异调用流程的自动融合。

BMC Genomics. 2015 Oct 28;16:875. doi: 10.1186/s12864-015-2050-y.

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM：一种用于下一代测序研究的基于Phred分数的基因型分型方法。

Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.

Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment.比较全基因组测序的调用管道：一项实证研究表明映射和比对的重要性。

Sci Rep. 2022 Dec 13;12(1):21502. doi: 10.1038/s41598-022-26181-3.

BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity.BAYSIC：一种用于组合基因组变异集的贝叶斯方法，可提高特异性和灵敏度。

BMC Bioinformatics. 2014 Apr 12;15:104. doi: 10.1186/1471-2105-15-104.

引用本文的文献

Leveraging new methods for comprehensive characterization of mitochondrial DNA in esophageal squamous cell carcinoma.利用新方法全面分析食管鳞癌中线粒体 DNA 特征。

Genome Med. 2024 Apr 2;16(1):50. doi: 10.1186/s13073-024-01319-2.

Identifying crossovers and shared genetic material in whole genome sequencing data from families.鉴定来自家族的全基因组测序数据中的交叉和共享遗传物质。

Genome Res. 2023 Oct;33(10):1747-1756. doi: 10.1101/gr.277172.122. Epub 2023 Oct 25.

Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity.利用家族将未映射的序列本地化，以验证端粒到端粒组装并确定新的遗传多样性热点。

Genome Res. 2023 Oct;33(10):1734-1746. doi: 10.1101/gr.277175.122. Epub 2023 Oct 25.

Grave-to-cradle: human embryonic lineage tracing from the postmortem body.从尸体到摇篮：人类胚胎谱系追踪。

Exp Mol Med. 2023 Jan;55(1):13-21. doi: 10.1038/s12276-022-00912-y. Epub 2023 Jan 4.

Transmission dynamics of human herpesvirus 6A, 6B and 7 from whole genome sequences of families.从家庭的全基因组序列中分析人类疱疹病毒 6A、6B 和 7 的传播动力学。

Virol J. 2022 Dec 24;19(1):225. doi: 10.1186/s12985-022-01941-9.

The human "contaminome": bacterial, viral, and computational contamination in whole genome sequences from 1000 families.人类“污染组”：1000 个家庭的全基因组序列中的细菌、病毒和计算污染。

Sci Rep. 2022 Jun 14;12(1):9863. doi: 10.1038/s41598-022-13269-z.

A Method for Localizing Non-Reference Sequences to the Human Genome.一种将非参考序列定位到人类基因组的方法。

Pac Symp Biocomput. 2022;27:313-324.

本文引用的文献

Using off-target data from whole-exome sequencing to improve genotyping accuracy, association analysis and polygenic risk prediction.利用全外显子组测序的脱靶数据提高基因分型准确性、关联分析和多基因风险预测。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa084.

Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes.对通过网络招募的457个自闭症家庭进行外显子组测序，为自闭症风险基因提供了证据。

NPJ Genom Med. 2019 Aug 23;4:19. doi: 10.1038/s41525-019-0093-8. eCollection 2019.

Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks.遗传和新生的自闭症遗传风险影响共享网络。

Cell. 2019 Aug 8;178(4):850-866.e26. doi: 10.1016/j.cell.2019.07.015.

The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.NHGRI-EBI GWAS Catalog 于 2019 年发布的已发表全基因组关联研究、靶向基因芯片和汇总统计数据

Nucleic Acids Res. 2019 Jan 8;47(D1):D1005-D1012. doi: 10.1093/nar/gky1120.

The Emerging Role of Long Noncoding RNAs in Human Disease.长链非编码RNA在人类疾病中的新作用

Methods Mol Biol. 2018;1706:91-110. doi: 10.1007/978-1-4939-7471-9_6.

SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research.SPARK：一项涉及 5 万个美国家庭的队列研究，以加速自闭症研究。

Neuron. 2018 Feb 7;97(3):488-493. doi: 10.1016/j.neuron.2018.01.015.

Genomic Patterns of De Novo Mutation in Simplex Autism.单纯性自闭症的新生突变基因组模式

Cell. 2017 Oct 19;171(3):710-722.e12. doi: 10.1016/j.cell.2017.08.047. Epub 2017 Sep 28.

Differences between the genomes of lymphoblastoid cell lines and blood-derived samples.淋巴母细胞系基因组与血液来源样本基因组之间的差异。

Adv Genomics Genet. 2017;7:1-9. doi: 10.2147/AGG.S128824. Epub 2017 Feb 23.

Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis.GRCh38人类参考基因组对高通量测序数据分析的改进及影响

Genomics. 2017 Mar;109(2):83-90. doi: 10.1016/j.ygeno.2017.01.005. Epub 2017 Jan 26.

A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree.通过对一个包含17名成员的三代家系进行测序，经遗传继承验证的540万个定相人类变异的参考数据集。

Genome Res. 2017 Jan;27(1):157-164. doi: 10.1101/gr.210500.116. Epub 2016 Nov 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用家系估计测序错误率。

Estimating sequencing error rates using families.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献