利用大语言模型和贝叶斯推理，将基因型和表型数据用于群体规模的变异分类。

Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

作者信息

Manders Toby R, Tan Christopher A, Kobayashi Yuya, Wahl Alexander, Araya Carlos, Colavin Alexandre, Facio Flavia M, Metz Hillery, Reuter Jason, Frésard Laure, Padigepati Samskruthi R, Stafford David A, Nussbaum Robert L, Nykamp Keith

机构信息

Labcorp Genetics Inc, 1400 16th Street, San Francisco, CA, 94103, USA.

Invitae Corporation, 1400 16th Street, San Francisco, CA, 94103, USA.

出版信息

Hum Genet. 2025 Apr 23. doi: 10.1007/s00439-025-02743-z.

DOI:10.1007/s00439-025-02743-z

PMID:40266329

Abstract

Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls ("patient score"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity ("variant score"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.

摘要

遗传性疾病基因检测中的意义未明变异（VUS）给患者和临床医生带来了负担，但由于缺乏可扩展策略，本可减少VUS的临床数据未得到充分利用。我们评估了使用基因型和表型数据的机器学习方法是否能改善变异分类并减少VUS。在这项关于多步骤机器学习方法的队列研究中，来自测试申请表的患者数据被用于区分分子诊断患者和对照（“患者评分”）。然后，一个生成式贝叶斯模型使用患者评分和变异分类来推断变异致病性（“变异评分”）。该研究纳入了350万名因各种病症接受临床基因检测的患者。主要结果包括模型和基因水平的区分、分类性能、概率校准以及与正交致病性测量的一致性。纳入半定量分类框架是基于后验致病性概率匹配PPV≥0.99/NPV≥0.95的阈值，随后进行专家评审。我们生成了1334个临床变异模型（CVM）；595个在保留数据的两个机器学习步骤中均表现出高性能（AUROC患者≥0.8且AUROC变异≥0.8）。这些CVM的高置信度预测为在200174名患者中观察到的5362个VUS提供了证据，占这些基因中所有VUS观察结果的23.4%。在17个经常检测的基因中，CVM对1000多个独特的VUS进行了重新分类，每种病症的VUS报告率降低了9 - 49%。总之，使用未充分利用的临床数据的可扩展机器学习方法改善了变异分类并减少了VUS。

相似文献

Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

Hum Genet. 2025 Apr 23. doi: 10.1007/s00439-025-02743-z.

An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance.

Hum Genet. 2025 Mar;144(2-3):143-158. doi: 10.1007/s00439-025-02727-z. Epub 2025 Jan 27.

Aural toilet (ear cleaning) for chronic suppurative otitis media.

Cochrane Database Syst Rev. 2025 Jun 9;6(6):CD013057. doi: 10.1002/14651858.CD013057.pub3.

Molecular feature-based classification of retroperitoneal liposarcoma: a prospective cohort study.

Elife. 2025 May 23;14:RP100887. doi: 10.7554/eLife.100887.

Surveillance for Violent Deaths - National Violent Death Reporting System, 50 States, the District of Columbia, and Puerto Rico, 2022.

MMWR Surveill Summ. 2025 Jun 12;74(5):1-42. doi: 10.15585/mmwr.ss7405a1.

Analysis of the conditions for applying BRCA genetic testing to women with breast cancer using the Japanese HBOC consortium and the Japanese organization of hereditary breast and ovarian cancer (JOHBOC) registry project database.

Breast Cancer. 2025 May 5. doi: 10.1007/s12282-025-01704-8.

Stakeholders' perceptions and experiences of factors influencing the commissioning, delivery, and uptake of general health checks: a qualitative evidence synthesis.

Cochrane Database Syst Rev. 2025 Mar 20;3(3):CD014796. doi: 10.1002/14651858.CD014796.pub2.

Interventions for fertility preservation in women with cancer undergoing chemotherapy.

Cochrane Database Syst Rev. 2025 Jun 19;6:CD012891. doi: 10.1002/14651858.CD012891.pub2.

Mucolytics for children with chronic suppurative lung disease.

Cochrane Database Syst Rev. 2025 Mar 28;3(3):CD015313. doi: 10.1002/14651858.CD015313.pub2.

What makes a 'good' decision with artificial intelligence? A grounded theory study in paediatric care.

BMJ Evid Based Med. 2025 May 20;30(3):183-193. doi: 10.1136/bmjebm-2024-112919.

本文引用的文献

Clinical Variant Reclassification in Hereditary Disease Genetic Testing.

JAMA Netw Open. 2024 Nov 4;7(11):e2444526. doi: 10.1001/jamanetworkopen.2024.44526.

Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification.

Hum Genet. 2024 Aug;143(8):995-1004. doi: 10.1007/s00439-024-02691-0. Epub 2024 Aug 1.

Interpreting and integrating genomic tests results in clinical cancer care: Overview and practical guidance.

CA Cancer J Clin. 2024 May-Jun;74(3):264-285. doi: 10.3322/caac.21825. Epub 2024 Jan 4.

Rates and Classification of Variants of Uncertain Significance in Hereditary Disease Genetic Testing.

JAMA Netw Open. 2023 Oct 2;6(10):e2339571. doi: 10.1001/jamanetworkopen.2023.39571.

The landscape of reported VUS in multi-gene panel and genomic testing: Time for a change.

Genet Med. 2023 Dec;25(12):100947. doi: 10.1016/j.gim.2023.100947. Epub 2023 Jul 30.

Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Methods Inf Med. 2023 May;62(1-02):60-70. doi: 10.1055/s-0043-1762904. Epub 2023 Feb 22.

A large language model for electronic health records.

NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2.

Systematic use of phenotype evidence in clinical genetic testing reduces the frequency of variants of uncertain significance.

Am J Med Genet A. 2022 Sep;188(9):2642-2651. doi: 10.1002/ajmg.a.62779. Epub 2022 May 16.

The Gene Curation Coalition: A global effort to harmonize gene-disease evidence resources.

Genet Med. 2022 Aug;24(8):1732-1742. doi: 10.1016/j.gim.2022.04.017. Epub 2022 May 4.

The Challenge of Genetic Variants of Uncertain Clinical Significance : A Narrative Review.

Ann Intern Med. 2022 Jul;175(7):994-1000. doi: 10.7326/M21-4109. Epub 2022 Apr 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用大语言模型和贝叶斯推理，将基因型和表型数据用于群体规模的变异分类。

Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献