使用临床实验室数据进行结直肠癌预测的可解释机器学习模型

Explainable Machine Learning Models for Colorectal Cancer Prediction Using Clinical Laboratory Data.

作者信息

Li Rui, Hao Xiaoyan, Diao Yanjun, Yang Liu, Liu Jiayun

机构信息

Department of Clinical Laboratory Medicine, Xijing Hospital, Air Force Medical University, Xi'an, China.

出版信息

Cancer Control. 2025 Jan-Dec;32:10732748251336417. doi: 10.1177/10732748251336417. Epub 2025 May 7.

IntroductionEarly diagnosis of colorectal cancer (CRC) poses a significant clinical challenge. This study aims to develop machine learning (ML) models for CRC risk prediction using clinical laboratory data.MethodsThis retrospective, single-center study analyzed laboratory examination data from healthy controls (HC), polyp patients (Polyp), and CRC patients between 2013 and 2023. Five ML algorithms, including adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), decision tree (DT), logistic regression (LR), and random forest (RF), were employed to classify subjects into HC vs Polyp vs CRC, HC vs CRC, and Polyp vs CRC, respectively.ResultsThis study included 31 539 subjects: 11 793 HCs, 10 125 polyp patients, and 9621 CRC patients. The XGBoost model achieved the highest AUCs of 0.966 for differentiating HC from CRC and 0.881 for Polyp from CRC, outperforming carcino-embryonic antigen (CEA) and fecal occult blood testing (FOBT) tests. This model could also identify CEA-negative or FOBT-negative CRC patients. Incorporating stool miR-92a detection into the model further improved diagnostic performance. Shapley additive explanations (SHAP) plots indicated that FOBT, CEA, lymphocyte percentage (LYMPH%), and hematocrit (HCT) were the most significant features contributing to CRC diagnosis. Additionally, a computational tool for predicting CRC risk based on the optimal model was developed, designed for researchers with programming experience.ConclusionFive ML models for CRC diagnosis, based on ten routine laboratory test items, were developed, achieving higher diagnostic accuracies than traditional CRC biomarkers. The diagnostic capabilities of these ML models can be further enhanced by including stool miR-92a levels.

引言

结直肠癌（CRC）的早期诊断是一项重大的临床挑战。本研究旨在利用临床实验室数据开发用于CRC风险预测的机器学习（ML）模型。

方法

这项回顾性单中心研究分析了2013年至2023年间健康对照（HC）、息肉患者（Polyp）和CRC患者的实验室检查数据。采用了五种ML算法，包括自适应增强（AdaBoost）、极端梯度增强（XGBoost）、决策树（DT）、逻辑回归（LR）和随机森林（RF），分别将受试者分类为HC与Polyp与CRC、HC与CRC以及Polyp与CRC。

结果

本研究纳入了31539名受试者：11793名HC、10125名息肉患者和9621名CRC患者。XGBoost模型在区分HC与CRC方面的AUC最高，为0.966，在区分Polyp与CRC方面的AUC为0.881，优于癌胚抗原（CEA）和粪便潜血试验（FOBT）。该模型还可以识别CEA阴性或FOBT阴性的CRC患者。将粪便miR-92a检测纳入模型可进一步提高诊断性能。Shapley相加解释（SHAP）图表明，FOBT、CEA、淋巴细胞百分比（LYMPH%）和血细胞比容（HCT）是对CRC诊断贡献最大的显著特征。此外，还开发了一种基于最优模型预测CRC风险的计算工具，供有编程经验的研究人员使用。

结论

基于十项常规实验室检查项目开发了五种用于CRC诊断的ML模型，其诊断准确性高于传统的CRC生物标志物。通过纳入粪便miR-92a水平，这些ML模型的诊断能力可进一步提高。