Sifat Isteaq Kabir, Kibria Md Kaderi
Department of Statistics, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh.
PLoS One. 2024 Dec 23;19(12):e0315865. doi: 10.1371/journal.pone.0315865. eCollection 2024.
Hypertension (HTN) prediction is critical for effective preventive healthcare strategies. This study investigates how well ensemble learning techniques work to increase the accuracy of HTN prediction models. Utilizing a dataset of 612 participants from Ethiopia, which includes 27 features potentially associated with HTN risk, we aimed to enhance predictive performance over traditional single-model methods. A multi-faceted feature selection approach was employed, incorporating Boruta, Lasso Regression, Forward and Backward Selection, and Random Forest feature importance, and found 13 common features that were considered for prediction. Five machine learning (ML) models such as logistic regression (LR), artificial neural network (ANN), random forest (RF), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and a stacking ensemble model were trained using selected features to predict HTN. The models' performance on the testing set was evaluated using accuracy, precision, recall, F1-score, and area under the curve (AUC). Additionally, SHapley Additive exPlanations (SHAP) was utilized to examine the impact of individual features on the models' predictions and identify the most important risk factors for HTN. The stacking ensemble model emerged as the most effective approach for predicting HTN risk, achieving an accuracy of 96.32%, precision of 95.48%, recall of 97.51%, F1-score of 96.48%, and an AUC of 0.971. SHAP analysis of the stacking model identified weight, drinking habits, history of hypertension, salt intake, age, diabetes, BMI, and fat intake as the most significant and interpretable risk factors for HTN. Our results demonstrate significant advancements in predictive accuracy and robustness, highlighting the potential of ensemble learning as a pivotal tool in healthcare analytics. This research contributes to ongoing efforts to optimize HTN prediction models, ultimately supporting early intervention and personalized healthcare management.
高血压(HTN)预测对于有效的预防性医疗保健策略至关重要。本研究调查了集成学习技术在提高HTN预测模型准确性方面的效果。利用来自埃塞俄比亚的612名参与者的数据集,其中包括27个可能与HTN风险相关的特征,我们旨在提高预测性能,超越传统的单模型方法。采用了多方面的特征选择方法,包括Boruta、套索回归、向前和向后选择以及随机森林特征重要性,并确定了13个用于预测的共同特征。使用选定的特征训练了五个机器学习(ML)模型,如逻辑回归(LR)、人工神经网络(ANN)、随机森林(RF)、极端梯度提升(XGB)、轻梯度提升机(LGBM)以及一个堆叠集成模型来预测HTN。使用准确率、精确率、召回率、F1分数和曲线下面积(AUC)评估模型在测试集上的性能。此外,利用SHapley加性解释(SHAP)来检查单个特征对模型预测的影响,并确定HTN最重要的风险因素。堆叠集成模型成为预测HTN风险最有效的方法,准确率达到96.32%,精确率为95.48%,召回率为97.51%,F1分数为96.48%,AUC为0.971。对堆叠模型的SHAP分析确定体重、饮酒习惯、高血压病史、盐摄入量、年龄、糖尿病、BMI和脂肪摄入量是HTN最重要且可解释的风险因素。我们的结果表明在预测准确性和稳健性方面有显著进展,突出了集成学习作为医疗分析中关键工具的潜力。这项研究有助于正在进行的优化HTN预测模型的努力,最终支持早期干预和个性化医疗管理。