Susič David, Syed-Abdul Shabbir, Dovgan Erik, Jonnagaddala Jitendra, Gradišek Anton
Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia; Jožef Stefan International Postgraduate School, Jamova cesta 39, SI-1000 Ljubljana, Slovenia.
Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan.
Comput Methods Programs Biomed. 2023 Apr;231:107435. doi: 10.1016/j.cmpb.2023.107435. Epub 2023 Feb 21.
Colorectal cancer is a major health concern. It is now the third most common cancer and the fourth leading cause of cancer mortality worldwide. The aim of this study was to evaluate the performance of machine learning algorithms for predicting survival of colorectal cancer patients 1 to 5 years after diagnosis, and identify the most important variables.
A sample of 1236 patients diagnosed with colorectal cancer and 118 predictor variables has been used. The outcome of interest was a binary variable indicating whether the patient survived the number of years in question or not. 20 predictor variables were selected using mutual information score with the outcome. We implemented 11 machine learning algorithms and evaluated their performance with a 5 by 2-fold cross-validation with stratified folds and with paired Student's t-tests. We compared the results with the Kaplan-Meier estimator and Cox's proportional hazard regression.
Using the 20 most important predictor variables for each of the survival years, the logistic regression algorithm achieved an area under the receiver operating characteristic curve of 0.850 (0.014 SD, 0.840-0.860 95 % CI) for the 1-year, and 0.872 (0.014 SD, 0.861-0.882 95% CI) for the 5-year survival prediction. Using only the 5 most important predictor variables, the corresponding values are 0.793 (0.020 SD, 0.778-0.807 95% CI) and 0.794 (0.011 SD, 0.785-0.802 95% CI). The most important variables for 1-year prediction were number of R residual, M distant metastasis, overall stage, probable recurrence within 5 years, and tumour length, whereas for 5-year prediction the most important were probable recurrence within 5 years, R residual, M distant metastasis, number of positive lymph nodes, and palliative chemotherapy. Biomarkers do not appear among the top 20 most important ones. For all survival intervals, the probability of the top model agrees with the Kaplan-Meier estimate, both in the interval of one standard deviation and in the 95% confidence interval.
The findings suggest that machine learning algorithms can predict the survival probability of colorectal cancer patients and can be used to inform the patients and assist decision-making in clinical care management. In addition, this study unveils the most essential variables for estimating survival short- and long-term among patients with Colorectal cancer.
结直肠癌是一个重大的健康问题。它目前是全球第三大常见癌症,也是癌症死亡的第四大主要原因。本研究的目的是评估机器学习算法预测结直肠癌患者诊断后1至5年生存率的性能,并确定最重要的变量。
使用了一个包含1236例诊断为结直肠癌的患者和118个预测变量的样本。感兴趣的结果是一个二元变量,表明患者是否在所述年限内存活。使用与结果的互信息得分选择了20个预测变量。我们实施了11种机器学习算法,并通过具有分层折的5×2折交叉验证和配对学生t检验评估了它们的性能。我们将结果与Kaplan-Meier估计器和Cox比例风险回归进行了比较。
对于每个生存年限,使用20个最重要的预测变量,逻辑回归算法在1年生存预测中的受试者工作特征曲线下面积为0.850(标准差0.014,95%置信区间0.840 - 0.860),在5年生存预测中的面积为0.872(标准差0.014,95%置信区间0.861 - 0.882)。仅使用5个最重要的预测变量时,相应的值分别为0.793(标准差0.020,95%置信区间0.778 - 0.807)和0.794(标准差0.011,95%置信区间0.785 - 0.802)。1年预测中最重要的变量是R残差数量、M远处转移、总体分期、5年内可能复发以及肿瘤长度,而5年预测中最重要的是5年内可能复发、R残差、M远处转移、阳性淋巴结数量和姑息化疗。生物标志物未出现在前20个最重要的变量中。对于所有生存区间,顶级模型的概率在一个标准差区间和95%置信区间内均与Kaplan-Meier估计值一致。
研究结果表明,机器学习算法可以预测结直肠癌患者的生存概率,并可用于告知患者并协助临床护理管理中的决策。此外,本研究揭示了估计结直肠癌患者短期和长期生存的最关键变量。