Leitão Beatriz N, Veríssimo André, Carvalho Alexandra M, Vinga Susana
Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal.
Instituto de Telecomunicações (IT-Lisboa), Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal.
Genes (Basel). 2025 Apr 23;16(5):473. doi: 10.3390/genes16050473.
Glioblastoma is a highly aggressive brain tumour with poor survival outcomes, highlighting the need for reliable prognostic models. Developing robust and interpretable prognostic signatures is critical for improving patient stratification and guiding therapy. This study explored the integration of machine learning feature selection with regularised Cox regression to construct prognostic gene signatures for glioblastoma patients.
We combined the Boruta algorithm and Random Survival Forests (RSFs) with regularised Cox regression, along with network-based regularisation techniques (HubCox and OrphanCox), to develop interpretable prognostic signatures for stratifying high- and low-risk glioblastoma patients. Using mRNA-seq and survival data from The Cancer Genome Atlas (TCGA), we developed predictive models following WHO-2021 glioma guidelines.
Integrating Boruta or RSF with regularised Cox regression improved the performance and interpretability. Boruta increased the concordance indexes (C-indexes) by 0.030 and 0.013 for LASSO and Elastic Net, respectively, while significantly reducing the feature numbers. RSF similarly enhanced the performance and feature reduction. The genes Lysyl Oxidase Like 1 () and Insulin Like Growth Factor Binding Protein 6 () were consistently selected and linked to glioma survival, emphasising their clinical significance. The network-based methods demonstrated superior survival probability prediction (lower Integrated Brier Score), although with lower C-index values, highlighting limitations in ranking the survival times. To evaluate the generalisability, external validation using the Chinese Glioma Genome Atlas (CGGA) confirmed that a multigene signature derived from the most consistently selected genes significantly stratified the patients by risk.
This study underscored the utility of combining machine learning feature selection with survival analysis to enhance prognostic modelling while balancing predictive performance and interpretability.
胶质母细胞瘤是一种侵袭性很强的脑肿瘤,生存结果较差,这凸显了对可靠预后模型的需求。开发强大且可解释的预后特征对于改善患者分层和指导治疗至关重要。本研究探索了将机器学习特征选择与正则化Cox回归相结合,以构建胶质母细胞瘤患者的预后基因特征。
我们将Boruta算法和随机生存森林(RSF)与正则化Cox回归以及基于网络的正则化技术(HubCox和OrphanCox)相结合,以开发可解释的预后特征,用于对高风险和低风险胶质母细胞瘤患者进行分层。利用来自癌症基因组图谱(TCGA)的mRNA测序和生存数据,我们按照世界卫生组织2021年胶质瘤指南开发了预测模型。
将Boruta或RSF与正则化Cox回归相结合可提高性能和可解释性。Boruta分别使LASSO和弹性网络的一致性指数(C指数)提高了0.030和0.013,同时显著减少了特征数量。RSF同样提高了性能并减少了特征。赖氨酰氧化酶样1基因()和胰岛素样生长因子结合蛋白6基因()被一致选择并与胶质瘤生存相关,强调了它们的临床意义。基于网络的方法显示出优越的生存概率预测(较低 的综合Brier评分),尽管C指数值较低,突出了在对生存时间进行排名方面的局限性。为了评估通用性,使用中国胶质瘤基因组图谱(CGGA)进行的外部验证证实,源自最一致选择基因的多基因特征按风险对患者进行了显著分层。
本研究强调了将机器学习特征选择与生存分析相结合以增强预后建模,同时平衡预测性能和可解释性的实用性。