利用癌症突变数据为种系错义变异的致病性分类提供信息。

Leveraging cancer mutation data to inform the pathogenicity classification of germline missense variants.

作者信息

Haque Bushra, Cheerie David, Pan Amy, Curtis Meredith, Nalpathamkalam Thomas, Nguyen Jimmy, Salhab Celine, Thiruvahindrapuram Bhooma, Zhang Jade, Couse Madeline, Hartley Taila, Morrow Michelle M, Price E Magda, Walker Susan, Malkin David, Roth Frederick P, Costain Gregory

机构信息

Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada.

Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.

出版信息

PLoS Genet. 2025 Jan 6;21(1):e1011540. doi: 10.1371/journal.pgen.1011540. eCollection 2025 Jan.

DOI:10.1371/journal.pgen.1011540

PMID:39761285

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11737861/

Abstract

Innovative and easy-to-implement strategies are needed to improve the pathogenicity assessment of rare germline missense variants. Somatic cancer driver mutations identified through large-scale tumor sequencing studies often impact genes that are also associated with rare Mendelian disorders. The use of cancer mutation data to aid in the interpretation of germline missense variants, regardless of whether the gene is associated with a hereditary cancer predisposition syndrome or a non-cancer-related developmental disorder, has not been systematically assessed. We extracted putative cancer driver missense mutations from the Cancer Hotspots database and annotated them as germline variants, including presence/absence and classification in ClinVar. We trained two supervised learning models (logistic regression and random forest) to predict variant classifications of germline missense variants in ClinVar using Cancer Hotspot data (training dataset). The performance of each model was evaluated with an independent test dataset generated in part from searching public and private genome-wide sequencing datasets from ~1.5 million individuals. Of the 2,447 cancer mutations, 691 corresponding germline variants had been previously classified in ClinVar: 426 (61.6%) as likely pathogenic/pathogenic, 261 (37.8%) as uncertain significance, and 4 (0.6%) as likely benign/benign. The odds ratio for a likely pathogenic/pathogenic classification in ClinVar was 28.3 (95% confidence interval: 24.2-33.1, p < 0.001), compared with all other germline missense variants in the same 216 genes. Both supervised learning models showed high correlation with pathogenicity assessments in the training dataset. There was high area under precision-recall curve values (0.847 and 0.829) and area under the receiver-operating characteristic curve values (0.821 and 0.774) for logistic regression and random forest models, respectively, when applied to the test dataset. With the use of cancer and germline datasets and supervised learning techniques, our study shows that cancer mutation data can be leveraged to improve the interpretation of germline missense variation potentially causing rare Mendelian disorders.

摘要

需要创新且易于实施的策略来改进对罕见种系错义变异的致病性评估。通过大规模肿瘤测序研究鉴定出的体细胞癌驱动突变通常会影响那些也与罕见孟德尔疾病相关的基因。利用癌症突变数据来辅助解释种系错义变异，无论该基因是否与遗传性癌症易感性综合征或非癌症相关的发育障碍有关，这一点尚未得到系统评估。我们从癌症热点数据库中提取了假定的癌症驱动错义突变，并将它们注释为种系变异，包括在ClinVar中的存在与否及分类情况。我们训练了两个监督学习模型（逻辑回归和随机森林），以使用癌症热点数据（训练数据集）来预测ClinVar中种系错义变异的变异分类。每个模型的性能都通过一个独立测试数据集进行评估，该测试数据集部分是通过搜索来自约150万人的公共和私人全基因组测序数据集生成的。在2447个癌症突变中，有691个相应的种系变异先前已在ClinVar中分类：426个（61.6%）被分类为可能致病/致病，261个（37.8%）意义不明确，4个（0.6%）被分类为可能良性/良性。与同一216个基因中的所有其他种系错义变异相比，ClinVar中可能致病/致病分类的优势比为28.3（95%置信区间：24.2 - 33.1，p < 0.001）。两个监督学习模型在训练数据集中与致病性评估都显示出高度相关性。当应用于测试数据集时，逻辑回归和随机森林模型的精确召回曲线下面积值分别为0.847和0.829，以及受试者工作特征曲线下面积值分别为0.821和0.774。通过使用癌症和种系数据集以及监督学习技术，我们的研究表明，癌症突变数据可用于改进对可能导致罕见孟德尔疾病的种系错义变异的解释。