Suppr超能文献

使用SHAP增强的LightGBM模型进行基于人群的结直肠癌风险预测。

Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model.

作者信息

Du Guinian, Lv Hui, Liang Yishan, Zhang Jingyue, Huang Qiaoling, Xie Guiming, Wu Xian, Zeng Hao, Wu Lijuan, Ye Jianbo, Xie Wentan, Li Xia, Sun Yifan

机构信息

Department of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People's Hospital, Guigang, Guangxi, China.

Department of Laboratory Medicine, The People Hospital of Laibin, Laibin, Guangxi, China.

出版信息

Front Oncol. 2025 Jul 17;15:1575844. doi: 10.3389/fonc.2025.1575844. eCollection 2025.

Abstract

BACKGROUND

Colorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.

METHODS

We analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People's Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People's Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.

RESULTS

The LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.

CONCLUSIONS

Our LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.

摘要

背景

结直肠癌(CRC)是全球范围内一种高发癌症,早期检测和风险分层对于降低发病率和死亡率起着关键作用。我们旨在开发并验证一种使用临床数据的机器学习(ML)模型,以改善CRC的识别和预后评估。

方法

我们分析了来自贵港市人民医院(2020 - 2024年)的多中心数据集,其中包括676例CRC患者和410例对照用于模型训练/内部验证,以及来自来宾市人民医院的463例患者用于外部验证。系统比较了七种ML算法,最终选择轻梯度提升机(LightGBM)作为最佳框架。通过受试者操作特征曲线下面积(AUROC)分析、校准曲线、布里尔评分和决策曲线分析对模型性能进行了严格评估。采用SHAP(SHapley Additive exPlanations)方法进行特征解释。

结果

LightGBM模型在训练队列中的AUROC为0.9931(95%CI:0.9883 - 0.998),在外部验证中的AUROC为0.9429(95%CI:0.9176 - 0.9682),显示出卓越的区分能力。校准曲线显示预测结果与实际结果高度一致(布里尔评分为0.139)。SHAP分析确定了13个关键预测因子,年龄(平均SHAP值 = 0.216)和CA19 - 9(平均SHAP值 = 0.198)是主要贡献因素。其他重要变量包括血液学参数(白细胞、红细胞、血红蛋白、血小板)、生化标志物(谷丙转氨酶、总蛋白、白蛋白、尿素、尿酸)和性别。成功开发了一个基于网络的临床可实施风险计算器,用于实时概率估计。

结论

我们基于LightGBM的模型在保持临床可解释性的同时实现了高预测准确性,有效弥合了复杂的ML系统与实际临床决策之间的差距。所确定的生物标志物组为CRC发病机制提供了生物学见解。该工具在优化CRC管理中的早期诊断和个性化风险评估方面显示出巨大潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ae/12310463/f35e42b09cd4/fonc-15-1575844-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验