Ba Qinwen, Yuan Xu, Wang Yun, Shen Na, Xie Huaping, Lu Yanjun
Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China.
Department of Gastroenterology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China.
Biomedicines. 2024 Aug 27;12(9):1955. doi: 10.3390/biomedicines12091955.
Colorectal Polyps are the main source of precancerous lesions in colorectal cancer. To increase the early diagnosis of tumors and improve their screening, we aimed to develop a simple and non-invasive diagnostic prediction model for colorectal polyps based on machine learning (ML) and using accessible health examination records.
We conducted a single-center observational retrospective study in China. The derivation cohort, consisting of 5426 individuals who underwent colonoscopy screening from January 2021 to January 2024, was separated for training (cohort 1) and validation (cohort 2). The variables considered in this study included demographic data, vital signs, and laboratory results recorded by health examination records. With features selected by univariate analysis and Lasso regression analysis, nine machine learning methods were utilized to develop a colorectal polyp diagnostic model. Several evaluation indexes, including the area under the receiver-operating-characteristic curve (AUC), were used to compare the predictive performance. The SHapley additive explanation method (SHAP) was used to rank the feature importance and explain the final model.
14 independent predictors were identified as the most valuable features to establish the models. The adaptive boosting machine (AdaBoost) model exhibited the best performance among the 9 ML models in cohort 1, with accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC (95% CI) of 0.632 (0.618-0.646), 0.635 (0.550-0.721), 0.674 (0.591-0.758), 0.593 (0.576-0.611), 0.673 (0.654-0.691), 0.608 (0.560-0.655) and 0.687 (0.626-0.749), respectively. The final model gave an AUC of 0.675 in cohort 2. Additionally, the precision recall (PR) curve for the AdaBoost model reached the highest AUPR of 0.648, positioning it nearest to the upper right corner. SHAP analysis provided visualized explanations, reaffirming the critical factors associated with the risk of colorectal polyps in the asymptomatic population.
This study integrated the clinical and laboratory indicators with machine learning techniques to establish the predictive model for colorectal polyps, providing non-invasive, cost-effective screening strategies for asymptomatic individuals and guiding decisions for further examination and treatment.
结直肠息肉是结直肠癌癌前病变的主要来源。为提高肿瘤的早期诊断率并改进筛查方法,我们旨在基于机器学习(ML)并利用可获取的健康检查记录,开发一种简单且非侵入性的结直肠息肉诊断预测模型。
我们在中国进行了一项单中心观察性回顾性研究。将2021年1月至2024年1月期间接受结肠镜筛查的5426名个体组成的推导队列分为训练组(队列1)和验证组(队列2)。本研究考虑的变量包括健康检查记录中记录的人口统计学数据、生命体征和实验室结果。通过单因素分析和Lasso回归分析选择特征后,利用9种机器学习方法开发结直肠息肉诊断模型。使用包括受试者操作特征曲线下面积(AUC)在内的多个评估指标来比较预测性能。采用SHapley加法解释方法(SHAP)对特征重要性进行排名并解释最终模型。
确定了14个独立预测因子作为建立模型最有价值的特征。在队列1的9种ML模型中,自适应增强机器(AdaBoost)模型表现最佳,其准确率、灵敏度、特异度、阳性预测值、阴性预测值、F1分数和AUC(95%CI)分别为0.632(0.618 - 0.646)、0.635(0.550 - 0.721)、0.674(0.591 - 0.758)、0.593(0.576 - 0.611)、0.673(0.654 - 0.691)、0.608(0.560 - 0.655)和0.687(从0.626至0.749)。最终模型在队列2中的AUC为0.675。此外,AdaBoost模型的精确召回率(PR)曲线达到最高的AUPR为0.648,使其最接近右上角。SHAP分析提供了可视化解释,再次确认了无症状人群中与结直肠息肉风险相关的关键因素。
本研究将临床和实验室指标与机器学习技术相结合,建立了结直肠息肉预测模型,为无症状个体提供了非侵入性、具有成本效益的筛查策略,并为进一步检查和治疗的决策提供指导。