通过假阳性预测算法减少桑格确认测试。

Reducing Sanger confirmation testing through false positive prediction algorithms.

机构信息

HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.

出版信息

Genet Med. 2021 Jul;23(7):1255-1262. doi: 10.1038/s41436-021-01148-3. Epub 2021 Mar 25.

DOI:10.1038/s41436-021-01148-3

PMID:33767343

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8257489/

Abstract

PURPOSE

Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.

METHODS

We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives.

RESULTS

After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%.

CONCLUSION

Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE .

摘要

目的

临床基因组测序（cGS）后进行正交确认性测试是标准做法。虽然正交测试显著提高了特异性，但也导致了测试周转时间和成本的增加。本研究的目的是评估经过训练以识别 cGS 数据中假阳性变体的机器学习模型，以减少对正交测试的需求。

方法

我们对五个由基因组瓶联盟（GIAB）表征的参考人类基因组样本进行测序，并将结果与每个基因组的一组已建立的变体进行比较，这些变体称为真实集。然后，我们训练机器学习模型来识别被标记为假阳性的变体。

结果

经过训练，模型识别出 99.5%的假阳性杂合单核苷酸变体（SNV）和杂合插入/缺失变体（indels），同时将非操作性、非主要 SNV 的确认性测试减少了 85%，indels 减少了 75%。在临床实践中使用该算法可将双脱氧核苷酸（Sanger）测序的总体正交测试减少 71%。

结论

我们的结果表明，在保持低假阳性率的同时，可以显著减少确认性测试的需求。生成我们模型和结果的框架可在 https://github.com/HudsonAlpha/STEVE 上公开获取。

相似文献

Reducing Sanger confirmation testing through false positive prediction algorithms.通过假阳性预测算法减少桑格确认测试。

Genet Med. 2021 Jul;23(7):1255-1262. doi: 10.1038/s41436-021-01148-3. Epub 2021 Mar 25.

Reducing false-positive incidental findings with ensemble genotyping and logistic regression based variant filtering methods.使用基于基因分型组合和逻辑回归的变异筛选方法减少假阳性偶然发现。

Hum Mutat. 2014 Aug;35(8):936-44. doi: 10.1002/humu.22587. Epub 2014 Jun 24.

A universal algorithm for de novo decrypting of heterozygous indel sequences: a tool for personalized medicine.一种用于从头解密杂合插入缺失序列的通用算法：个性化医疗的工具。

Clin Chim Acta. 2008 Mar;389(1-2):7-13. doi: 10.1016/j.cca.2007.11.011. Epub 2007 Nov 23.

SICaRiO: short indel call filtering with boosting.SICaRiO：基于提升的短插入缺失（indel）调用滤波。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa238.

Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。

Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants.在检测外显子变异方面，全基因组测序比全外显子测序更强大。

Proc Natl Acad Sci U S A. 2015 Apr 28;112(17):5473-8. doi: 10.1073/pnas.1418631112. Epub 2015 Mar 31.

Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory.评估临床分子诊断实验室中对全外显子测序结果进行验证性检测的必要性。

Genet Med. 2014 Jul;16(7):510-5. doi: 10.1038/gim.2013.183. Epub 2014 Jan 9.

Accurate indel prediction using paired-end short reads.利用配对末端短读长进行准确的插入缺失预测。

BMC Genomics. 2013 Feb 27;14:132. doi: 10.1186/1471-2164-14-132.

A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree.通过对一个包含17名成员的三代家系进行测序，经遗传继承验证的540万个定相人类变异的参考数据集。

Genome Res. 2017 Jan;27(1):157-164. doi: 10.1101/gr.210500.116. Epub 2016 Nov 30.

A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing.基于捕获的下一代测序中变异调用准确性的机器学习模型。

BMC Genomics. 2018 Apr 17;19(1):263. doi: 10.1186/s12864-018-4659-0.

引用本文的文献

Genomic Evaluation of AML-Main Techniques and Novel Approaches.急性髓系白血病的基因组评估——主要技术与新方法

J Clin Med. 2025 Aug 11;14(16):5685. doi: 10.3390/jcm14165685.

Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation.通过机器学习模型确定下一代测序中的高可信度种系遗传变异：一种减轻正交确认负担的方法。

BMC Genomics. 2025 Aug 6;26(1):728. doi: 10.1186/s12864-025-11889-z.

Predicting high confidence ctDNA somatic variants with ensemble machine learning models.使用集成机器学习模型预测高置信度的ctDNA体细胞变异

Sci Rep. 2025 May 26;15(1):18384. doi: 10.1038/s41598-025-01326-2.

Novel variant alters splicing of in family with features of Loeys-Dietz syndrome.新型变异改变了具有洛伊斯-迪茨综合征特征的家族中[具体基因名称未给出]的剪接。

Front Genet. 2024 Dec 16;15:1435734. doi: 10.3389/fgene.2024.1435734. eCollection 2024.

StratoMod: predicting sequencing and variant calling errors with interpretable machine learning.StratoMod：使用可解释的机器学习预测测序和变异调用错误。

Commun Biol. 2024 Oct 13;7(1):1316. doi: 10.1038/s42003-024-06981-1.

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.通过将基因组特征与质量指标相结合，提高假阳性单核苷酸变异的过滤效果。

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad694.

Pharmacogenomic profile of actionable molecular variants related to drugs commonly used in anesthesia: WES analysis reveals new mutations.与麻醉常用药物相关的可操作分子变异的药物基因组学概况：全外显子组测序分析揭示新突变

Front Pharmacol. 2023 Mar 20;14:1047854. doi: 10.3389/fphar.2023.1047854. eCollection 2023.

Best practices for the interpretation and reporting of clinical whole genome sequencing.临床全基因组测序解读与报告的最佳实践

NPJ Genom Med. 2022 Apr 8;7(1):27. doi: 10.1038/s41525-022-00295-z.

A study of elective genome sequencing and pharmacogenetic testing in an unselected population.一项在未选择人群中进行的选择性基因组测序和药物遗传学检测研究。

Mol Genet Genomic Med. 2021 Sep;9(9):e1766. doi: 10.1002/mgg3.1766. Epub 2021 Jul 27.

本文引用的文献

How to Read Articles That Use Machine Learning: Users' Guides to the Medical Literature.如何阅读使用机器学习的文章：医学文献的用户指南。

JAMA. 2019 Nov 12;322(18):1806-1816. doi: 10.1001/jama.2019.16489.

Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy.Sentieon DNASeq变异检测工作流程展现出强大的计算性能和准确性。

Front Genet. 2019 Aug 20;10:736. doi: 10.3389/fgene.2019.00736. eCollection 2019.

An open resource for accurately benchmarking small variant and reference calls.用于准确基准测试小型变体和参考调用的开放资源。

Nat Biotechnol. 2019 May;37(5):561-566. doi: 10.1038/s41587-019-0074-6. Epub 2019 Apr 1.

Best practices for benchmarking germline small-variant calls in human genomes.人类基因组中小变异calls 的基准测试最佳实践。

Nat Biotechnol. 2019 May;37(5):555-560. doi: 10.1038/s41587-019-0054-x. Epub 2019 Mar 11.

A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing-Detected Variants with an Orthogonal Method in Clinical Genetic Testing.临床基因检测中采用正交方法确认下一代测序检测到的变异体必要性的严格实验室间检验

J Mol Diagn. 2019 Mar;21(2):318-329. doi: 10.1016/j.jmoldx.2018.10.009. Epub 2019 Jan 3.

Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.下一代测序生物信息学管道验证的标准和指南：分子病理学协会和美国病理学家学院的联合建议。

J Mol Diagn. 2018 Jan;20(1):4-27. doi: 10.1016/j.jmoldx.2017.11.003. Epub 2017 Nov 21.

Variant Review with the Integrative Genomics Viewer.使用综合基因组浏览器进行变异审查。

Cancer Res. 2017 Nov 1;77(21):e31-e34. doi: 10.1158/0008-5472.CAN-17-0337.

Sanger Confirmation Is Required to Achieve Optimal Sensitivity and Specificity in Next-Generation Sequencing Panel Testing.在新一代测序 panel 检测中，需要进行桑格验证以实现最佳的灵敏度和特异性。

J Mol Diagn. 2016 Nov;18(6):923-932. doi: 10.1016/j.jmoldx.2016.07.006. Epub 2016 Oct 6.

Extensive sequencing of seven human genomes to characterize benchmark reference materials.对七个人类基因组进行广泛测序以表征基准参考材料。

Sci Data. 2016 Jun 7;3:160025. doi: 10.1038/sdata.2016.25.

A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases.用于遗传疾病应急管理的26小时高灵敏度全基因组测序系统。

Genome Med. 2015 Sep 30;7:100. doi: 10.1186/s13073-015-0221-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。