当前的癌症驱动变异预测器学会识别驱动基因，而不是功能变异。

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants.

机构信息

ESAT-STADIUS, KU Leuven, Leuven, 3001, Belgium.

Università di Torino, Torino, Italy, Torino, 10123, Italy.

出版信息

BMC Biol. 2021 Jan 13;19(1):3. doi: 10.1186/s12915-020-00930-0.

DOI:10.1186/s12915-020-00930-0

PMID:33441128

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7807764/

Abstract

BACKGROUND

Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task.

RESULTS

In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions.

CONCLUSIONS

To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.

摘要

背景

识别驱动肿瘤进展的变异（驱动变异），并将其与癌症失控细胞生长的副产品变异（乘客变异）区分开来，是理解肿瘤发生和精准肿瘤学的关键步骤。各种生物信息学方法都试图解决这一复杂任务。

结果

在这项研究中，我们研究了这些方法所基于的假设，表明驱动变异和乘客变异的不同定义会影响预测任务的难度。更重要的是，我们证明了数据集存在构建偏差，这使得机器学习（ML）方法无法真正学习变异级别的功能效应，尽管它们的性能非常出色。这种效应是由于在这些数据集中，驱动变异映射到少数几个驱动基因，而乘客变异则分布在数千个基因中，因此仅仅学习识别驱动基因就可以提供几乎完美的预测。

结论

为了解决这个问题，我们提出了一个新的数据集，通过确保数据涵盖的所有基因都包含驱动变异和乘客变异，来最小化这种偏差。结果表明，测试的预测器的性能显著下降，这不应该被视为较差的建模，而应该被视为纠正不必要的乐观。最后，我们提出了一种加权程序，可以完全消除基因对这些预测的影响，从而精确评估预测器对单个变异的功能效应进行建模的能力，我们确实表明，这项任务仍然存在。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b84/7807764/872597e0da46/12915_2020_930_Fig1_HTML.jpg

相似文献

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants.当前的癌症驱动变异预测器学会识别驱动基因，而不是功能变异。

BMC Biol. 2021 Jan 13;19(1):3. doi: 10.1186/s12915-020-00930-0.

Use of signals of positive and negative selection to distinguish cancer genes and passenger genes.利用正选择和负选择信号区分癌症基因和乘客基因。

Elife. 2021 Jan 11;10:e59629. doi: 10.7554/eLife.59629.

InDEP: an interpretable machine learning approach to predict cancer driver genes from multi-omics data.InDEP：一种从多组学数据预测癌症驱动基因的可解释机器学习方法。

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad318.

Machine Learning Classification and Structure-Functional Analysis of Cancer Mutations Reveal Unique Dynamic and Network Signatures of Driver Sites in Oncogenes and Tumor Suppressor Genes.机器学习分类和癌症突变的结构-功能分析揭示了癌基因和肿瘤抑制基因中驱动位点的独特动态和网络特征。

J Chem Inf Model. 2018 Oct 22;58(10):2131-2150. doi: 10.1021/acs.jcim.8b00414. Epub 2018 Oct 3.

In silico saturation mutagenesis of cancer genes.癌症基因的计算机饱和诱变。

Nature. 2021 Aug;596(7872):428-432. doi: 10.1038/s41586-021-03771-1. Epub 2021 Jul 28.

Comparison of different functional prediction scores using a gene-based permutation model for identifying cancer driver genes.基于基因排列模型的不同功能预测评分比较，用于鉴定癌症驱动基因。

BMC Med Genomics. 2019 Jan 31;12(Suppl 1):22. doi: 10.1186/s12920-018-0452-9.

OncoVar: an integrated database and analysis platform for oncogenic driver variants in cancers.OncoVar：癌症中致癌驱动变异的综合数据库和分析平台。

Nucleic Acids Res. 2021 Jan 8;49(D1):D1289-D1301. doi: 10.1093/nar/gkaa1033.

Machine learning methods for prediction of cancer driver genes: a survey paper.机器学习方法在癌症驱动基因预测中的应用：综述论文。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac062.

LOTUS: A single- and multitask machine learning algorithm for the prediction of cancer driver genes.LOTUS：一种用于癌症驱动基因预测的单任务和多任务机器学习算法。

PLoS Comput Biol. 2019 Sep 30;15(9):e1007381. doi: 10.1371/journal.pcbi.1007381. eCollection 2019 Sep.

Identifying driver mutations from sequencing data of heterogeneous tumors in the era of personalized genome sequencing.在个性化基因组测序时代，从异质性肿瘤的测序数据中识别驱动突变。

Brief Bioinform. 2014 Mar;15(2):244-55. doi: 10.1093/bib/bbt042. Epub 2013 Jul 1.

引用本文的文献

Explainable deep learning for stratified medicine in inflammatory bowel disease.用于炎症性肠病分层医学的可解释深度学习

Genome Biol. 2025 Jul 24;26(1):223. doi: 10.1186/s13059-025-03692-6.

The specification game: rethinking the evaluation of drug response prediction for precision oncology.规范博弈：重新思考精准肿瘤学中药物反应预测的评估以提高精准度

J Cheminform. 2025 Mar 14;17(1):33. doi: 10.1186/s13321-025-00972-y.

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep.结合进化与蛋白质语言模型，利用D2Deep进行可解释的癌症驱动基因突变预测。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae664.

AI-derived comparative assessment of the performance of pathogenicity prediction tools on missense variants of breast cancer genes.基于人工智能的乳腺癌基因错义变异致病性预测工具性能的比较评估。

Hum Genomics. 2024 Sep 11;18(1):99. doi: 10.1186/s40246-024-00667-9.

Discovering predisposing genes for hereditary breast cancer using deep learning.利用深度学习发现遗传性乳腺癌的易患基因。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae346.

VIPpred: a novel model for predicting variant impact on phosphorylation events driving carcinogenesis.VIPpred：一种新型模型，用于预测驱动致癌性的磷酸化事件变异的影响。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad480.

Metabolic Interplay in the Tumor Microenvironment: Implications for Immune Function and Anticancer Response.肿瘤微环境中的代谢相互作用：对免疫功能和抗癌反应的影响

Curr Issues Mol Biol. 2023 Dec 5;45(12):9753-9767. doi: 10.3390/cimb45120609.

Cancer driver mutations: predictions and reality.癌症驱动突变：预测与现实。

Trends Mol Med. 2023 Jul;29(7):554-566. doi: 10.1016/j.molmed.2023.03.007. Epub 2023 Apr 17.

HPMPdb: A machine learning-ready database of protein molecular phenotypes associated to human missense variants.HPMPdb：一个可供机器学习使用的、与人类错义变体相关的蛋白质分子表型数据库。

Curr Res Struct Biol. 2022 May 13;4:167-174. doi: 10.1016/j.crstbi.2022.04.004. eCollection 2022.

Predicting functional consequences of mutations using molecular interaction network features.利用分子相互作用网络特征预测突变的功能后果。

Hum Genet. 2022 Jun;141(6):1195-1210. doi: 10.1007/s00439-021-02329-5. Epub 2021 Aug 25.

本文引用的文献

An interpretable low-complexity machine learning framework for robust exome-based - diagnosis of Crohn's disease patients.一种用于基于外显子组的克罗恩病患者稳健诊断的可解释低复杂度机器学习框架。

NAR Genom Bioinform. 2020 Feb 21;2(1):lqaa011. doi: 10.1093/nargab/lqaa011. eCollection 2020 Mar.

Insight into the protein solubility driving forces with neural attention.用神经注意力洞察蛋白质溶解度驱动力。

PLoS Comput Biol. 2020 Apr 30;16(4):e1007722. doi: 10.1371/journal.pcbi.1007722. eCollection 2020 Apr.

Integrated Informatics Analysis of Cancer-Related Variants.癌症相关变异的综合信息学分析

JCO Clin Cancer Inform. 2020 Mar;4:310-317. doi: 10.1200/CCI.19.00132.

Comprehensive assessment of computational algorithms in predicting cancer driver mutations.癌症驱动突变预测计算算法的综合评估。

Genome Biol. 2020 Feb 20;21(1):43. doi: 10.1186/s13059-020-01954-z.

Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis.探索结合机器学习的生物物理倾向性尺度在蛋白质序列分析中的局限性。

Sci Rep. 2019 Nov 15;9(1):16932. doi: 10.1038/s41598-019-53324-w.

CHASMplus Reveals the Scope of Somatic Missense Mutations Driving Human Cancers.CHASMplus 揭示了驱动人类癌症的体细胞错义突变的范围。

Cell Syst. 2019 Jul 24;9(1):9-23.e8. doi: 10.1016/j.cels.2019.05.005. Epub 2019 Jun 12.

Unmasking Clever Hans predictors and assessing what machines really learn.揭开聪明汉斯预测者的面具，评估机器真正学到了什么。

Nat Commun. 2019 Mar 11;10(1):1096. doi: 10.1038/s41467-019-08987-4.

The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers.COSMIC 癌症基因目录：描述所有人类癌症中的遗传功能障碍。

Nat Rev Cancer. 2018 Nov;18(11):696-705. doi: 10.1038/s41568-018-0060-1.

Comprehensive Characterization of Cancer Driver Genes and Mutations.全面描绘癌症驱动基因和突变。

Cell. 2018 Apr 5;173(2):371-385.e18. doi: 10.1016/j.cell.2018.02.060.

Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations.癌症基因组解读器注释肿瘤改变的生物学和临床相关性。

Genome Med. 2018 Mar 28;10(1):25. doi: 10.1186/s13073-018-0531-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

当前的癌症驱动变异预测器学会识别驱动基因，而不是功能变异。

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献