Nezamabadi Farahani Leila, Kazemnejad Anoshirvan, Afrasiabi Mahlagha, Tapak Leili
Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran.
Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran. Email:
Cell J. 2025 Jun 8;26(12):688-699. doi: 10.22074/cellj.2025.2034704.1618.
This study aimed to develop a hybrid model for variable selection in high-dimensional survival analysis using a support vector regression (SVR), to identify prognostic biomarkers associated with survival in oral cancer (OC) patients through the analysis of gene expression data.
In this retrospective cohort study, gene expression profiles (54,613 probes) related to 97 patients from the GSE41613 dataset from the GEO repository were used. First of all, martingale residuals were obtained using a Cox regression without covariates, and were used as pseudo-survival outcome. Then, the particle swarm optimization (PSO) and genetic algorithm (GA) were used in combination with SVR for selecting features related to pseudo-survival outcome. Concordance index (C-index), mean absolute error (MAE), mean squared error (MSE) and R-squares, were used to evaluate the performance of the models using selected features. Functional enrichment analysis was performed using DAVID database, and external validation utilized three independent datasets (GSE9844, GSE75538, GSE37991, GSE42743).
The findings indicated that the PSO-based method outperformed the GA-based method, achieving a smaller MAE (0.061) and MSE (0.005), R-square (0.99) and C-index (0.973), selecting 291 probes from 1069 screened. A protein-protein interaction (PPI) network was constructed, including 200 nodes and 120 edges. Eleven key genes with the highest degree, including and were identified as significant biomarkers associated with OC survival.
The PSO-based hybrid model effectively improved SVR performance in survival prediction for OC patients and identified key prognostic biomarkers. Despite its promising results and validation on independent datasets, limitations in generalizability and signs of overfitting suggest the model is not yet ready for clinical use. Further studies with larger, diverse datasets are recommended.
本研究旨在开发一种用于高维生存分析中变量选择的混合模型,该模型采用支持向量回归(SVR),通过对基因表达数据的分析来识别口腔癌(OC)患者生存相关的预后生物标志物。
在这项回顾性队列研究中,使用了来自GEO数据库中GSE41613数据集的97例患者的基因表达谱(54,613个探针)。首先,使用无协变量的Cox回归获得鞅残差,并将其用作伪生存结局。然后,将粒子群优化算法(PSO)和遗传算法(GA)与SVR结合使用,以选择与伪生存结局相关的特征。使用一致性指数(C-index)、平均绝对误差(MAE)、均方误差(MSE)和决定系数(R平方)来评估使用所选特征的模型的性能。使用DAVID数据库进行功能富集分析,并利用三个独立数据集(GSE9844、GSE75538、GSE37991、GSE42743)进行外部验证。
研究结果表明,基于PSO的方法优于基于GA 的方法,MAE(0.061)和MSE(0.005)更小,R平方为0.99,C-index为0.973,从筛选出的1069个探针中选择了291个探针。构建了一个蛋白质-蛋白质相互作用(PPI)网络,包括200个节点和120条边。确定了11个度最高的关键基因,包括 和 ,它们被确定为与OC生存相关的重要生物标志物。
基于PSO的混合模型有效地提高了SVR在OC患者生存预测中的性能,并识别出关键的预后生物标志物。尽管在独立数据集上取得了有前景的结果并得到了验证,但在可推广性方面的局限性和过拟合迹象表明该模型尚未准备好用于临床。建议使用更大、更多样化的数据集进行进一步研究。