Park Bomi, Kim Chung Ho, Jun Jae Kwan, Suh Mina, Choi Kui Son, Choi Il Ju, Oh Hyun Jin
Department of Preventive Medicine, College of Medicine, Chung-Ang University, Seoul, Korea.
National Cancer Control Institute, National Cancer Center, Goyang, Korea.
Cancer Res Treat. 2024 Dec 16. doi: 10.4143/crt.2024.843.
Gastric cancer (GC) prediction models hold potential for enhancing early detection by enabling the identification of high-risk individuals, facilitating personalized risk-based screening, and optimizing the allocation of healthcare resources.
In this study, we developed a machine learning-based GC prediction model utilizing data from the Korean National Health Insurance Service, encompassing 10,515,949 adults who had not been diagnosed with GC and underwent GC screening during 2013-2014, with a follow-up period of at least five years. The cohort was divided into training and test datasets at an 8:2 ratio, and class imbalance was mitigated through random oversampling.
Among various models, logistic regression demonstrated the highest predictive performance, with an area under the receiver operating characteristic curve (AUC) of 0.708, which was consistent with the AUC obtained in external validation (0.669). Importantly, the outcomes were robust to missing data imputation and variable selection. The SHapley Additive exPlanations (SHAP) algorithm enhanced the explainability of the model, identifying advancing age, being male, Helicobacter pylori infection, current smoking, and a family history of GC as key predictors of elevated risk.
This predictive model could significantly contribute to the early identification of individuals at elevated risk for gastric cancer, thereby enabling the implementation of targeted preventive strategies. Furthermore, the integration of noninvasive and cost-effective predictors enhances the clinical utility of the model, supporting its potential application in routine healthcare settings.
胃癌(GC)预测模型通过识别高危个体、促进基于风险的个性化筛查以及优化医疗资源分配,在提高早期检测方面具有潜力。
在本研究中,我们利用韩国国民健康保险服务的数据开发了一种基于机器学习的GC预测模型,该数据涵盖了2013 - 2014年期间10,515,949名未被诊断患有GC且接受了GC筛查的成年人,随访期至少为五年。队列以8:2的比例分为训练集和测试集,并通过随机过采样减轻类不平衡。
在各种模型中,逻辑回归表现出最高的预测性能,受试者工作特征曲线(AUC)下面积为0.708,这与外部验证中获得的AUC(0.669)一致。重要的是,结果对缺失数据插补和变量选择具有稳健性。SHapley加法解释(SHAP)算法增强了模型的可解释性,确定年龄增长、男性、幽门螺杆菌感染、当前吸烟以及GC家族史是风险升高的关键预测因素。
这种预测模型可以显著有助于早期识别胃癌高危个体,从而能够实施有针对性的预防策略。此外,整合非侵入性和成本效益高的预测因素增强了模型的临床实用性,支持其在常规医疗环境中的潜在应用。