SiliconBlast Ltd., Calgary, AB, Canada.
Medtronic, Abu Dhabi, United Arab Emirates.
BMC Med Inform Decis Mak. 2024 Aug 23;24(1):234. doi: 10.1186/s12911-024-02634-9.
Responding to the rising global prevalence of noncommunicable diseases (NCDs) requires improvements in the management of high blood pressure. Therefore, this study aims to develop an explainable machine learning model for predicting high blood pressure, a key NCD risk factor, using data from the STEPwise approach to NCD risk factor surveillance (STEPS) surveys. Nationally representative samples of adults aged 18-69 years were acquired from 57 countries spanning six World Health Organization (WHO) regions. Data harmonization and processing were performed to standardize the selected predictors and synchronize features across countries, yielding 41 variables, including demographic, behavioural, physical, and biochemical factors. Five machine learning models - logistic regression, k-nearest neighbours, random forest, XGBoost, and a fully connected neural network - were trained and evaluated at global, regional, and country-specific levels using an 80/20 train-test split. The models' performance was assessed using accuracy, precision, recall, and F1 score. Feature importance analysis identified age, weight, heart rate, waist circumference, and height as key predictors of blood pressure. Across the 57 countries studied, model performances varied considerably, with accuracy ranging from as low as 58.96% in some models for specific countries to as high as 81.41% in others, underscoring the need for region and country-specific adaptations in modelling approaches. The explainable model offers an opportunity for population-level screening and continuous risk assessment in resource-limited settings.
为应对全球非传染性疾病(NCDs)患病率不断上升的问题,需要改进高血压管理。因此,本研究旨在利用来自 STEP 式 NCD 风险因素监测(STEPS)调查的数据集,开发一种可解释的机器学习模型,用于预测高血压这一关键 NCD 风险因素。研究采集了来自六大世界卫生组织(WHO)区域的 57 个国家 18-69 岁成年人的全国代表性样本。对数据进行了协调和处理,以标准化所选预测因子并使各国特征同步,共得到 41 个变量,包括人口统计学、行为、身体和生物化学因素。使用 80/20 的训练-测试分割,在全球、区域和国家特定水平上训练和评估了五种机器学习模型 - 逻辑回归、k-最近邻、随机森林、XGBoost 和全连接神经网络。使用准确性、精确性、召回率和 F1 分数评估了模型的性能。特征重要性分析确定年龄、体重、心率、腰围和身高是血压的关键预测因子。在所研究的 57 个国家中,模型性能差异很大,某些特定国家的某些模型的准确性低至 58.96%,而其他模型的准确性高达 81.41%,这突显出在建模方法中需要进行区域和国家特定的调整。可解释模型为资源有限环境中的人群水平筛查和持续风险评估提供了机会。