Cheng Yinlin, Gu Kuiying, Ji Weidong, Hu Zhensheng, Yang Yining, Zhou Yi
Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China.
School of Public Health, Soochow University, Suzhou, China.
J Med Internet Res. 2025 Mar 12;27:e68442. doi: 10.2196/68442.
Hypertension is a major global health issue and a significant modifiable risk factor for cardiovascular diseases, contributing to a substantial socioeconomic burden due to its high prevalence. In China, particularly among populations living near desert regions, hypertension is even more prevalent due to unique environmental and lifestyle conditions, exacerbating the disease burden in these areas, underscoring the urgent need for effective early detection and intervention strategies.
This study aims to develop, calibrate, and prospectively validate a 2-year hypertension risk prediction model by using large-scale health examination data collected from populations residing in 4 regions surrounding the Taklamakan Desert of northwest China.
We retrospectively analyzed the health examination data of 1,038,170 adults (2019-2021) and prospectively validated our findings in a separate cohort of 961,519 adults (2021-2023). Data included demographics, lifestyle factors, physical examinations, and laboratory measurements. Feature selection was performed using light gradient-boosting machine-based recursive feature elimination with cross-validation and Least Absolute Shrinkage and Selection Operator, yielding 24 key predictors. Multiple machine learning (logistic regression, random forest, extreme gradient boosting, light gradient-boosting machine) and deep learning (Feature Tokenizer + Transformer, SAINT) models were trained with Bayesian hyperparameter optimization.
Over a 2-year follow-up, 15.20% (157,766/1,038,170) of the participants in the retrospective cohort and 10.50% (101,077/961,519) in the prospective cohort developed hypertension. Among the models developed, the CatBoost model demonstrated the best performance, achieving area under the curve (AUC) values of 0.888 (95% CI 0.886-0.889) in the retrospective cohort and 0.803 (95% CI 0.801-0.804) in the prospective cohort. Calibration via isotonic regression improved the model's probability estimates, with Brier scores of 0.090 (95% CI 0.089-0.091) and 0.102 (95% CI 0.101-0.103) in the internal validation and prospective cohorts, respectively. Participants were ranked by the positive predictive value calculated using the calibrated model and stratified into 4 risk categories (low, medium, high, and very high), with the very high group exhibiting a 41.08% (5741/13,975) hypertension incidence over 2 years. Age, BMI, and socioeconomic factors were identified as significant predictors of hypertension.
Our machine learning model effectively predicted the 2-year risk of hypertension, making it particularly suitable for preventive health care management in high-risk populations residing in the desert regions of China. Our model exhibited excellent predictive performance and has potential for clinical application. A web-based application was developed based on our predictive model, which further enhanced the accessibility for clinical and public health use, aiding in reducing the burden of hypertension through timely prevention strategies.
高血压是一个重大的全球健康问题,是心血管疾病的一个重要可改变风险因素,因其高患病率导致了巨大的社会经济负担。在中国,特别是在沙漠地区附近居住的人群中,由于独特的环境和生活方式条件,高血压更为普遍,这加剧了这些地区的疾病负担,凸显了对有效早期检测和干预策略的迫切需求。
本研究旨在利用从中国西北塔克拉玛干沙漠周边4个地区居民收集的大规模健康检查数据,开发、校准并前瞻性验证一个为期2年的高血压风险预测模型。
我们回顾性分析了1038170名成年人(2019 - 2021年)的健康检查数据,并在前瞻性队列中对961519名成年人(2021 - 2023年)的研究结果进行了验证。数据包括人口统计学、生活方式因素、体格检查和实验室测量。使用基于轻梯度提升机的递归特征消除结合交叉验证和最小绝对收缩与选择算子进行特征选择,得出24个关键预测因子。使用贝叶斯超参数优化对多个机器学习(逻辑回归、随机森林、极端梯度提升、轻梯度提升机)和深度学习(特征分词器+Transformer、SAINT)模型进行训练。
在为期2年的随访中,回顾性队列中的参与者有15.20%(157766/1038170)、前瞻性队列中的参与者有10.50%(101077/961519)患上了高血压。在所开发的模型中,CatBoost模型表现最佳,在回顾性队列中的曲线下面积(AUC)值为0.888(95%CI 0.886 - 0.889),在前瞻性队列中的AUC值为0.803(95%CI 0.801 - 0.804)。通过等渗回归进行校准提高了模型的概率估计,内部验证队列和前瞻性队列中的布里尔分数分别为0.090(95%CI 0.089 - 0.091)和0.102(95%CI 0.101 - 0.103)。根据使用校准模型计算的阳性预测值对参与者进行排名,并分为4个风险类别(低、中、高和非常高),非常高风险组在2年内的高血压发病率为41.08%(5741/13975)。年龄、体重指数和社会经济因素被确定为高血压的重要预测因子。
我们的机器学习模型有效地预测了2年的高血压风险,使其特别适用于中国沙漠地区高危人群的预防性医疗保健管理。我们的模型表现出优异的预测性能,具有临床应用潜力。基于我们的预测模型开发了一个基于网络的应用程序,这进一步提高了临床和公共卫生使用的可及性,有助于通过及时的预防策略减轻高血压负担。