Gammall Jurgita, Lai Alvina G
Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK.
Oracle Global Services Limited, London, UK.
Discov Oncol. 2025 May 12;16(1):735. doi: 10.1007/s12672-025-02523-1.
The growing burden of cancer and recent surge in healthcare data availability call for new ways of analysing this multifactorial disease and improving patient outcomes. The aim of this study is to develop and evaluate prognostic cancer survival models across ten common cancer types based on a large patient sample. We compare the performance of different machine learning algorithms and assess the added value of genetic information in cancer prognosis. We also provide ways to improve model explainabilty which is critical for model adoption in clinical practice. This study included data from 9977 patients with bladder, breast, colorectal, endometrial, glioma, leukaemia, lung, ovarian, prostate, and renal cancers. Genetic data collected through the 100,000 Genomes Project was linked with clinical and demographic data provided by the National Cancer Registration and Analysis Service, Hospital Episode Statistics and Office for National Statistics. More than 500 prognostic features were assessed and four machine learning algorithms including Elastic Net Cox proportional hazards regression, random survival forest, gradient boosting survival and DeepSurv neural network were developed in this study. Most models achieved good performance varying from 60% in bladder cancer to 80% in glioma with the average C-index of 72% across all cancer types. Different machine learning methods achieved similar performance with DeepSurv model slightly underperforming compared to other methods. Addition of genetic data improved performance in endometrial, glioma, ovarian and prostate cancers, showing its potential importance for cancer prognosis. Patient's age, stage, grade, referral route, waiting times, pre-existing conditions, previous hospital utilisation, tumour mutational burden and mutations in gene TP53 were among the most important features in cancer survival modelling. By offering a comprehensive set of predictive models for cancer survival, this study fills a critical gap in our understanding of cancer prognosis and provides new tools for informing cancer treatment and consequently improving patient outcomes.
癌症负担日益加重,且近期医疗保健数据的可得性激增,这就需要新的方法来分析这种多因素疾病并改善患者预后。本研究的目的是基于大量患者样本,开发并评估十种常见癌症类型的预后癌症生存模型。我们比较了不同机器学习算法的性能,并评估了基因信息在癌症预后中的附加价值。我们还提供了提高模型可解释性的方法,这对于模型在临床实践中的应用至关重要。本研究纳入了9977例膀胱癌、乳腺癌、结直肠癌、子宫内膜癌、神经胶质瘤、白血病、肺癌、卵巢癌、前列腺癌和肾癌患者的数据。通过“10万基因组计划”收集的基因数据与国家癌症登记与分析服务局、医院事件统计数据和国家统计局提供的临床及人口统计学数据相关联。本研究评估了500多个预后特征,并开发了四种机器学习算法,包括弹性网络Cox比例风险回归、随机生存森林、梯度提升生存模型和深度生存神经网络。大多数模型表现良好,膀胱癌的模型表现为60%,神经胶质瘤的模型表现为80%,所有癌症类型的平均C指数为72%。不同的机器学习方法表现相似,深度生存模型的表现略逊于其他方法。添加基因数据提高了子宫内膜癌、神经胶质瘤、卵巢癌和前列腺癌的模型性能,显示了其在癌症预后中的潜在重要性。患者的年龄、分期、分级、转诊途径、等待时间、既往病史、既往住院情况、肿瘤突变负担和TP53基因的突变是癌症生存建模中最重要的特征。通过提供一套全面的癌症生存预测模型,本研究填补了我们在癌症预后理解方面的关键空白,并为指导癌症治疗从而改善患者预后提供了新工具。