Sahat Oraya, Kamsa-Ard Supot, Lim Apiradee, Kamsa-Ard Siriporn, Garcia-Constantino Matias, Ekerete Idongesit
Student of Doctor of Public Health Program, Faculty of Public Health, Khon Kaen University, Khon Kaen, Thailand.
Department of Epidemiology and Biostatistics, Faculty of Public Health, Khon Kaen University, Khon Kaen, Thailand.
BMC Public Health. 2025 Jun 7;25(1):2137. doi: 10.1186/s12889-025-23119-y.
Cholangiocarcinoma (CCA) poses a significant public health challenge in Thailand, with notably high incidence rates. This study aimed to compare the performance of spatial prediction models using Machine Learning techniques to analyze the occurrence of CCA across Thailand.
This retrospective cohort study analyzed CCA cases from four population-based cancer registries in Thailand, diagnosed between January 1, 2012, and December 31, 2021. The study employed Machine Learning models (Linear Regression, Random Forest, Neural Network, and Extreme Gradient Boosting (XGBoost)) to predict Age-Standardized Rates (ASR) of CCA based on spatial variables. Model performance was evaluated using Root Mean Square Error (RMSE) and R with 70:30 train-test validation.
The study included 6,379 CCA cases, with a male predominance (4,075 cases; 63.9%) and a mean age of 66.2 years (standard deviation = 11.1 years). The northeastern region accounted for most of the cases (3,898 cases; 61.1%). The overall ASR of CCA was 8.9 per 100,000 person-years (95% CI: 8.7 to 9.2), with the northeastern region showing the highest incidence (ASR = 13.4 per 100,000 person-years; 95% CI: 12.9 to 13.8). In the overall dataset, the Random Forest model demonstrated better prediction performance in both the training (R = 72.07%) and testing datasets (R = 71.66%). Regional variations in model performance were observed, with Random Forest performing best in the northern, northeastern regions, while XGBoost excelled in the central and southern regions. The most important spatial predictors for CCA were elevation and distance from water sources.
The Random Forest model demonstrated the highest efficiency in predicting CCA incidence rates in Thailand, though predictive performance varied across regions. Spatial factors effectively predicted ASR of CCA, providing valuable insights for national-level disease surveillance and targeted public health interventions. These findings support the development of region-specific approaches for CCA control using spatial epidemiology and machine learning techniques.
胆管癌(CCA)在泰国构成了重大的公共卫生挑战,发病率显著较高。本研究旨在比较使用机器学习技术的空间预测模型分析泰国各地CCA发病情况的性能。
这项回顾性队列研究分析了泰国四个基于人群的癌症登记处2012年1月1日至2021年12月31日期间诊断的CCA病例。该研究采用机器学习模型(线性回归、随机森林、神经网络和极端梯度提升(XGBoost))根据空间变量预测CCA的年龄标准化发病率(ASR)。使用均方根误差(RMSE)和R进行70:30的训练-测试验证来评估模型性能。
该研究纳入了6379例CCA病例,男性居多(4075例;63.9%),平均年龄为66.2岁(标准差=11.1岁)。东北地区病例最多(3898例;61.1%)。CCA的总体ASR为每10万人年8.9例(95%置信区间:8.7至9.2),东北地区发病率最高(ASR=每10万人年13.4例;95%置信区间:12.9至13.8)。在整个数据集中,随机森林模型在训练数据集(R=72.07%)和测试数据集(R=71.66%)中均表现出更好的预测性能。观察到模型性能存在区域差异,随机森林在北部、东北地区表现最佳,而XGBoost在中部和南部地区表现出色。CCA最重要的空间预测因素是海拔和与水源的距离。
随机森林模型在预测泰国CCA发病率方面表现出最高效率,尽管预测性能因地区而异。空间因素有效地预测了CCA的ASR,为国家层面的疾病监测和有针对性的公共卫生干预提供了有价值的见解。这些发现支持使用空间流行病学和机器学习技术开发针对CCA控制的特定区域方法。