必需性、蛋白质-蛋白质相互作用和进化特性是使用机器学习识别癌症相关基因的关键预测指标。

Essentiality, protein-protein interactions and evolutionary properties are key predictors for identifying cancer-associated genes using machine learning.

作者信息

Safadi Amro, Lovell Simon C, Doig Andrew J

机构信息

Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, M13 9PT, UK.

Division of Neuroscience, School of Biological Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, M13 9BL, UK.

出版信息

Sci Rep. 2024 Apr 22;14(1):9199. doi: 10.1038/s41598-023-44118-2.

DOI:10.1038/s41598-023-44118-2

PMID:38649399

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11035574/

Abstract

The distinctive nature of cancer as a disease prompts an exploration of the special characteristics the genes implicated in cancer exhibit. The identification of cancer-associated genes and their characteristics is crucial to further our understanding of this disease and enhanced likelihood of therapeutic drug targets success. However, the rate at which cancer genes are being identified experimentally is slow. Applying predictive analysis techniques, through the building of accurate machine learning models, is potentially a useful approach in enhancing the identification rate of these genes and their characteristics. Here, we investigated gene essentiality scores and found that they tend to be higher for cancer-associated genes compared to other protein-coding human genes. We built a dataset of extended gene properties linked to essentiality and used it to train a machine-learning model; this model reached 89% accuracy and > 0.85 for the Area Under Curve (AUC). The model showed that essentiality, evolutionary-related properties, and properties arising from protein-protein interaction networks are particularly effective in predicting cancer-associated genes. We were able to use the model to identify potential candidate genes that have not been previously linked to cancer. Prioritising genes that score highly by our methods could aid scientists in their cancer genes research.

摘要

癌症作为一种疾病的独特性质促使人们探索与癌症相关的基因所表现出的特殊特征。识别癌症相关基因及其特征对于加深我们对这种疾病的理解以及提高治疗药物靶点成功的可能性至关重要。然而，通过实验识别癌症基因的速度很慢。应用预测分析技术，通过构建准确的机器学习模型，可能是提高这些基因及其特征识别率的一种有用方法。在这里，我们研究了基因必需性评分，发现与其他蛋白质编码人类基因相比，癌症相关基因的必需性评分往往更高。我们构建了一个与必需性相关的扩展基因特性数据集，并使用它来训练一个机器学习模型；该模型的准确率达到了89%，曲线下面积（AUC）大于0.85。该模型表明，必需性、进化相关特性以及蛋白质-蛋白质相互作用网络产生的特性在预测癌症相关基因方面特别有效。我们能够使用该模型识别以前未与癌症相关联的潜在候选基因。通过我们的方法对得分高的基因进行优先级排序可以帮助科学家进行癌症基因研究。