Germer Sebastian, Rudolph Christiane, Labohm Louisa, Katalinic Alexander, Rath Natalie, Rausch Katharina, Holleczek Bernd, Handels Heinz
German Research Center for Artificial Intelligence (DFKI), Ratzeburger Allee 160, 23562 Lübeck, Germany.
Institute for Cancer Epidemiology, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany.
Int J Med Inform. 2024 Nov;191:105607. doi: 10.1016/j.ijmedinf.2024.105607. Epub 2024 Aug 26.
Survival analysis based on cancer registry data is of paramount importance for monitoring the effectiveness of health care. As new methods arise, the compendium of statistical tools applicable to cancer registry data grows. In recent years, machine learning approaches for survival analysis were developed. The aim of this study is to compare the model performance of the well established Cox regression and novel machine learning approaches on a previously unused dataset.
The study is based on lung cancer data from the Schleswig-Holstein Cancer Registry. Four survival analysis models are compared: Cox Proportional Hazard Regression (CoxPH) as the most commonly used statistical model, as well as Random Survival Forests (RSF) and two neural network architectures based on the DeepSurv and TabNet approaches. The models are evaluated using the concordance index (C-I), the Brier score and the AUC-ROC score. In addition, to gain more insight in the decision process of the models, we identified the features that have an higher impact on patient survival using permutation feature importance scores and SHAP values.
Using a dataset including the cancer stage established by the Union for International Cancer Control (UICC), the best performing model is the CoxPH (C-I: 0.698±0.005), while using a dataset which includes the tumor size, lymph node and metastasis status (TNM) leads to the RSF as best performing model (C-I: 0.703±0.004). The explainability metrics show that the models rely on the combined UICC stage and the metastasis status in the first place, which corresponds to other studies.
The studied methods are highly relevant for epidemiological researchers to create more accurate survival models, which can help physicians make informed decisions about appropriate therapies and management of patients with lung cancer, ultimately improving survival and quality of life.
基于癌症登记数据的生存分析对于监测医疗保健效果至关重要。随着新方法的出现,适用于癌症登记数据的统计工具集不断增加。近年来,开发了用于生存分析的机器学习方法。本研究的目的是在一个以前未使用过的数据集上比较成熟的Cox回归和新颖的机器学习方法的模型性能。
本研究基于石勒苏益格-荷尔斯泰因癌症登记处的肺癌数据。比较了四种生存分析模型:作为最常用统计模型的Cox比例风险回归(CoxPH),以及随机生存森林(RSF)和基于DeepSurv和TabNet方法的两种神经网络架构。使用一致性指数(C-I)、Brier分数和AUC-ROC分数对模型进行评估。此外,为了更深入地了解模型的决策过程,我们使用排列特征重要性分数和SHAP值确定了对患者生存影响更大的特征。
使用包含国际癌症控制联盟(UICC)确定的癌症分期的数据集,表现最佳的模型是CoxPH(C-I:0.698±0.005),而使用包含肿瘤大小、淋巴结和转移状态(TNM)的数据集时,表现最佳的模型是RSF(C-I:0.703±0.004)。可解释性指标表明,模型首先依赖于UICC分期和转移状态的组合,这与其他研究一致。
所研究的方法对于流行病学研究人员创建更准确的生存模型非常重要,这可以帮助医生就肺癌患者的适当治疗和管理做出明智的决策,最终提高生存率和生活质量。