Xiao Boao, Yang Min, Meng Yao, Wang Weimin, Chen Yuan, Yu Chenglong, Bai Longlong, Xiao Lishun, Chen Yansu
School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
Sci Rep. 2025 Jan 21;15(1):2701. doi: 10.1038/s41598-025-86872-5.
Colorectal cancer (CRC) is a prevalent malignant tumor that presents significant challenges to both public health and healthcare systems. The aim of this study was to develop a machine learning model based on five years of clinical follow-up data from CRC patients to accurately identify individuals at risk of poor prognosis. This study included 411 CRC patients who underwent surgery at Yixing Hospital and completed the follow-up process. A modeling dataset containing 73 characteristic variables was established by collecting demographic information, clinical blood test indicators, histopathological results, and additional treatment-related information. Decision tree, random forest, support vector machine, and extreme gradient boosting (XGBoost) models were selected for modeling based on the features identified through recursive feature elimination (RFE). The Cox proportional hazards model was used as the baseline for model comparison. During the model training process, hyperparameters were optimized using a grid search method. The model performance was comprehensively assessed using multiple metrics, including accuracy, F1 score, Brier score, sensitivity, specificity, positive predictive value, negative predictive value, receiver operating characteristic curve, calibration curve, and decision curve analysis curve. For the selected optimal model, the decision-making process was interpreted using the SHapley Additive exPlanations (SHAP) method. The results show that the optimal RFE-XGBoost model achieved an accuracy of 0.83 (95% CI 0.76-0.90), an F1 score of 0.81 (95% CI 0.72-0.88), and an area under the receiver operating characteristic curve of 0.89 (95% CI 0.82-0.94). Furthermore, the model exhibited superior calibration and clinical utility. SHAP analysis revealed that increased perioperative transfusion quantity, higher tumor AJCC stage, elevated carcinoembryonic antigen level, elevated carbohydrate antigen 19-9 (CA19-9) level, advanced age, and elevated carbohydrate antigen 125 (CA125) level were correlated with increased individual mortality risk. The RFE-XGBoost model demonstrated excellent performance in predicting CRC patient prognosis, and the application of the SHAP method bolstered the model's credibility and utility.
结直肠癌(CRC)是一种常见的恶性肿瘤,给公共卫生和医疗系统都带来了重大挑战。本研究的目的是基于CRC患者五年的临床随访数据开发一种机器学习模型,以准确识别预后不良风险个体。本研究纳入了411例在宜兴医院接受手术并完成随访过程的CRC患者。通过收集人口统计学信息、临床血液检测指标、组织病理学结果以及其他与治疗相关的信息,建立了一个包含73个特征变量的建模数据集。基于通过递归特征消除(RFE)确定的特征,选择决策树、随机森林、支持向量机和极端梯度提升(XGBoost)模型进行建模。Cox比例风险模型用作模型比较的基线。在模型训练过程中,使用网格搜索方法优化超参数。使用多个指标全面评估模型性能,包括准确率、F1分数、Brier分数、灵敏度、特异性、阳性预测值、阴性预测值、受试者工作特征曲线、校准曲线和决策曲线分析曲线。对于选定的最优模型,使用SHapley加性解释(SHAP)方法解释决策过程。结果表明,最优的RFE-XGBoost模型的准确率为0.83(95%CI 0.76-0.90),F1分数为0.81(95%CI 0.72-0.88),受试者工作特征曲线下面积为0.89(95%CI 0.82-0.94)。此外,该模型表现出卓越的校准和临床实用性。SHAP分析显示,围手术期输血量增加、肿瘤AJCC分期较高、癌胚抗原水平升高、糖类抗原19-9(CA19-9)水平升高、高龄以及糖类抗原125(CA125)水平升高与个体死亡风险增加相关。RFE-XGBoost模型在预测CRC患者预后方面表现出色,SHAP方法的应用增强了模型的可信度和实用性。