Peduzzi Giulia, Felici Alessio, Pellungrini Roberto, Campa Daniele
Department of Biology, University of Pisa, Via Luca Ghini, 13 - 56126, Pisa, Italy.
Classe di scienze, Scuola Normale Superiore, Piazza dei Cavalieri, 7 - 56126, Pisa, Italy.
Dig Liver Dis. 2025 Apr;57(4):915-922. doi: 10.1016/j.dld.2024.11.010. Epub 2024 Dec 3.
Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy.
This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML).
Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models.
All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules.
The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.
鉴于胰腺导管腺癌(PDAC)的高死亡率,预测其发病风险至关重要。当前的PDAC风险预测模型依赖于有限的变量,未纳入遗传学因素,且准确性一般。
本研究旨在基于机器学习(ML)开发一种可解释的PDAC风险预测模型。
在英国生物银行的654例PDAC病例和1308例对照中,测试了基于56个暴露组变量和多基因风险评分(PRS)构建的5种ML模型(自适应提升、极端梯度提升、CatBoost、深度森林和随机森林)。此外,还采用了夏普利值加法解释(SHAP)和通过递归划分进行全局模型解释(Girp)来解释这些模型。
所有模型表现相似,但基于召回率,最佳模型是CatBoost(77.10%)。SHAP强调年龄和PRS是所有模型的主要贡献因素。Girp制定了区分病例和对照的规则,在大多数规则中识别出年龄、PRS和胰腺炎。
所测试的预测模型表现良好,表明其在不久的将来在临床领域的潜在应用,正如解释器所表明的,PRS在识别高危个体中起关键作用。