Ayubi Erfan, Farashi Sajjad, Tapak Leili, Afshar Saeid
Cancer Research Center, Institute of Cancer, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, Iran.
Neurophysiology Research Center, Institute of Neuroscience and Mental Health, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, Iran.
Heliyon. 2024 Dec 24;11(1):e41443. doi: 10.1016/j.heliyon.2024.e41443. eCollection 2025 Jan 15.
The purpose of the current study was to develop and validate a biomarker-based prediction model for metastasis in patients with colorectal cancer (CRC).
Two datasets, GSE68468 and GSE41568, were retrieved from the Gene Expression Omnibus (GEO) database. In the GSE68468 dataset, key biomarkers were identified through a screening process involving differential expression analysis, redundancy analysis, and recursive feature elimination technique. Subsequently, the prediction model was developed and internally validated using five machine learning (ML) algorithms including lasso and elastic-net regularized generalized linear model (glmnet), k-nearest neighbors (kNN), support vector machine (SVM) with Radial Basis Function Kernel, random forest (RF), and eXtreme Gradient Boosting (XGBoost). The predictive performance of the algorithm with the highest accuracy was then externally validated on the GSE41568 dataset.
Among 22,283 registered genes in the GSE68468 dataset, the screening process identified 16 key genes including and these genes were used to build the prediction model. On the internal validation dataset, the prediction performance of five ML algorithms was as follows; RF (accuracy = 0.97 and kappa = 0.91), XGBoost (0.93, 0.81), kNN (0.93, 0.81), glmnet (0.93, 0.82) and SVM (0.92, 0.80). Top five biomarkers were and . The RF model exhibited an accuracy of 0.97, a kappa value of 0.92, and an area under the curve (AUC) of 0.99 in the external validation dataset.
The results of this study have identified biomarkers through ML algorithms which help to identify patients with CRC prone to metastasis.
本研究旨在开发并验证一种基于生物标志物的预测模型,用于预测结直肠癌(CRC)患者的转移情况。
从基因表达综合数据库(GEO)中检索了两个数据集,即GSE68468和GSE41568。在GSE68468数据集中,通过差异表达分析、冗余分析和递归特征消除技术等筛选过程确定关键生物标志物。随后,使用包括套索和弹性网络正则化广义线性模型(glmnet)、k近邻(kNN)、带径向基函数核的支持向量机(SVM)、随机森林(RF)和极端梯度提升(XGBoost)在内的五种机器学习(ML)算法开发预测模型并进行内部验证。然后在GSE41568数据集上对准确率最高的算法的预测性能进行外部验证。
在GSE68468数据集中登记的22283个基因中,筛选过程确定了16个关键基因,包括 ,这些基因被用于构建预测模型。在内部验证数据集上,五种ML算法的预测性能如下:RF(准确率 = 0.97,kappa值 = 0.91)、XGBoost(0.93,0.81)、kNN(0.93,0.81)、glmnet(0.93,0.82)和SVM(0.92,0.80)。排名前五的生物标志物是 和 。在外部验证数据集中,RF模型的准确率为0.97,kappa值为0.92,曲线下面积(AUC)为0.99。
本研究结果通过ML算法确定了生物标志物,有助于识别易发生转移的CRC患者。