Estiko Reza Ishak, Widyantoro Bambang, Juzar Dafsah Arifa, Yusup Ramdhan Maulana, Rakhmat Iqbal Fauzi, Rijanto Estiko
Prof. Dr. Margono Soekarjo General Hospital, Purwokerto, Central Java, Indonesia.
Department of Cardiology and Vascular Medicine, Faculty of Medicine, Universitas Indonesia/National Cardiovascular Center Harapan Kita, Jakarta, Special Capital Region of Jakarta, Indonesia.
BMJ Open. 2025 Aug 27;15(8):e092364. doi: 10.1136/bmjopen-2024-092364.
This study aimed to screen for hypertension in a vast Indonesian population using machine learning (ML) and 11 non-laboratory risk factors, validating the results through internal and external validations.
From the initial 1 782 365 participants aged 15 and above registered at the Integrated Counseling Post primary care centres across Indonesia from 2014 to 2017, incomplete data and outliers were excluded, and 268 210 participants were included in our analysis. The dataset was split deterministically into a dataset for training using 10-fold internal cross-validation of 204 315 participants and another dataset for external validation of 63 895 participants.
This retrospective cross-sectional study used three ML algorithms, that is, random forest, gradient boosting and extreme gradient boosting (XGBoost), and compared them against logistic regression as a benchmark to screen hypertension based on the WHO and International Society of Hypertension criteria. The importance of the risk factors was ranked. By partly using continuous versus categorical age, waist circumference (WC) and body mass index (BMI) risk factors, we evaluated the screening performance regarding sensitivity and area under the receiver operating characteristic curve (AUC).
The external validations revealed that the XGBoost model performed the best in hypertension screening. The external validation, which partly uses continuous variables, provides 0.97 sensitivity and 0.75 AUC, indicating excellent screening capability. The importance rank of the risk factors was consecutively family history of hypertension (FH-HTN), age, WC, BMI, occupation, education, sex, smoking, low physical activity, lack of fruit or vegetable intake and alcohol consumption.
By using 11 easy-to-collect non-laboratory risk factors, the ML model successfully screens for hypertension with better performance than the benchmark. Using the numerical variables of age, WC and BMI yields a better discrimination capability than the categorical variables. FH-HTN and age are the two top risk factors for the development of hypertension. This study is a useful academic exercise and shows ML's importance in handling large data sets.
本研究旨在利用机器学习(ML)和11个非实验室风险因素在广大印度尼西亚人群中筛查高血压,并通过内部和外部验证来验证结果。
在2014年至2017年期间于印度尼西亚各地综合咨询初级保健中心登记的1782365名15岁及以上的初始参与者中,排除了不完整数据和异常值,268210名参与者纳入我们的分析。数据集被确定性地分为用于204315名参与者的10倍内部交叉验证训练的数据集和用于63895名参与者外部验证的另一个数据集。
这项回顾性横断面研究使用了三种ML算法,即随机森林、梯度提升和极端梯度提升(XGBoost),并将它们与逻辑回归作为基准进行比较,以根据世界卫生组织和国际高血压学会标准筛查高血压。对风险因素的重要性进行了排名。通过部分使用连续与分类的年龄、腰围(WC)和体重指数(BMI)风险因素,我们评估了筛查在敏感性和受试者工作特征曲线下面积(AUC)方面的性能。
外部验证显示,XGBoost模型在高血压筛查中表现最佳。部分使用连续变量的外部验证提供了0.97的敏感性和0.75的AUC,表明具有出色的筛查能力。风险因素的重要性排名依次为高血压家族史(FH-HTN)、年龄、WC、BMI、职业、教育程度、性别、吸烟、低体力活动、缺乏水果或蔬菜摄入以及饮酒。
通过使用11个易于收集的非实验室风险因素,ML模型成功筛查高血压,其性能优于基准。使用年龄、WC和BMI的数值变量比分类变量具有更好的辨别能力。FH-HTN和年龄是高血压发生的两个首要风险因素。本研究是一项有用的学术实践,显示了ML在处理大数据集方面的重要性。