CDMPred:一种用于预测具有高质量乘客突变的癌症驱动点突变的工具。

CDMPred: a tool for predicting cancer driver missense mutations with high-quality passenger mutations.

机构信息

Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China.

School of Information Engineering, Huangshan University, Huangshan, Anhui, China.

出版信息

PeerJ. 2024 Sep 6;12:e17991. doi: 10.7717/peerj.17991. eCollection 2024.

Abstract

Most computational methods for predicting driver mutations have been trained using positive samples, while negative samples are typically derived from statistical methods or putative samples. The representativeness of these negative samples in capturing the diversity of passenger mutations remains to be determined. To tackle these issues, we curated a balanced dataset comprising driver mutations sourced from the COSMIC database and high-quality passenger mutations obtained from the Cancer Passenger Mutation database. Subsequently, we encoded the distinctive features of these mutations. Utilizing feature correlation analysis, we developed a cancer driver missense mutation predictor called CDMPred employing feature selection through the ensemble learning technique XGBoost. The proposed CDMPred method, utilizing the top 10 features and XGBoost, achieved an area under the receiver operating characteristic curve (AUC) value of 0.83 and 0.80 on the training and independent test sets, respectively. Furthermore, CDMPred demonstrated superior performance compared to existing state-of-the-art methods for cancer-specific and general diseases, as measured by AUC and area under the precision-recall curve. Including high-quality passenger mutations in the training data proves advantageous for CDMPred's prediction performance. We anticipate that CDMPred will be a valuable tool for predicting cancer driver mutations, furthering our understanding of personalized therapy.

摘要

大多数用于预测驱动突变的计算方法都是使用阳性样本进行训练的,而阴性样本通常来自统计方法或假定的样本。这些阴性样本在捕捉乘客突变多样性方面的代表性仍有待确定。为了解决这些问题,我们从 COSMIC 数据库中收集了一个包含驱动突变的平衡数据集,并从 Cancer Passenger Mutation 数据库中获得了高质量的乘客突变。随后,我们对这些突变的特征进行了编码。利用特征相关性分析,我们开发了一种名为 CDMPred 的癌症驱动突变错义预测器,该预测器采用集成学习技术 XGBoost 通过特征选择。在所提出的 CDMPred 方法中,利用前 10 个特征和 XGBoost,在训练集和独立测试集上的接收者操作特征曲线(AUC)值分别为 0.83 和 0.80。此外,CDMPred 在 AUC 和精度-召回曲线下面积方面的表现优于现有的癌症特异性和一般疾病的最先进方法。在训练数据中包含高质量的乘客突变对 CDMPred 的预测性能有利。我们预计 CDMPred 将成为预测癌症驱动突变的有价值的工具,进一步加深我们对个性化治疗的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1949/11382650/8130a70810fd/peerj-12-17991-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索