Riveros Perez Efrain, Avella-Molano Bibiana
Augusta University Medical College of Georgia, Augusta, Georgia, USA
Augusta University Medical College of Georgia, Augusta, Georgia, USA.
BMJ Open. 2025 Mar 22;15(3):e096595. doi: 10.1136/bmjopen-2024-096595.
This study aimed to compare the performance of five machine learning algorithms to predict diabetes mellitus based on lifestyle factors (diet and physical activity).
Retrospective cross-sectional predictive modelling study.
This study was conducted using publicly available data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey designed to assess the health and nutritional status of the US population.
We analysed data from 29 509 non-pregnant adults who participated in NHANES between 2007 and 2018.
The primary outcome was the prediction of type 2 diabetes mellitus (T2DM) by self-reported responses based on machine learning models. The performance of five machine learning algorithms (logistic regression, support vector machine, random forest, XGBoost and CatBoost) was evaluated using accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and the area under the receiver operating characteristic curve (AUC). The secondary outcome measures were feature importance and model performance comparison.
XGBoost exhibited the highest overall predictive performance (AUC 0.8168), followed by random forest and logistic regression (AUCs around 0.79). In terms of accuracy, logistic regression, XGBoost and random forest performed similarly at approximately 85%. While most models demonstrated high specificity (>97%), the SVM stood out for having the highest sensitivity (58.57%), although with a lower accuracy (62.44%). This trade-off underscores the strength of SVM in identifying more true-positive cases, though at the cost of lower overall classification precision. The random forest model, despite having lower sensitivity (7.15%), provided one of the most balanced performances in terms of specificity and interpretability.
The results support the use of machine learning models, particularly XGBoost, for early identification of individuals at risk for T2DM. Despite their limited sensitivity, the high specificity and accuracy underscore these models' potential for non-invasive risk assessment. This study is innovative in its integration of machine learning algorithms to predict type 2 diabetes based solely on non-invasive, easily accessible lifestyle and anthropometric variables, demonstrating the potential of data-driven models for early risk assessment without requiring laboratory tests. Despite the lower sensitivity observed in most models, their high specificity makes them valuable for early screening in clinical and public health settings, where they can be complemented with follow-up assessments or ensemble approaches that optimise the balance between sensitivity and specificity for improved risk stratification.
本研究旨在比较五种机器学习算法基于生活方式因素(饮食和身体活动)预测糖尿病的性能。
回顾性横断面预测建模研究。
本研究使用了来自美国国家健康与营养检查调查(NHANES)的公开数据,该调查是一项具有全国代表性的调查,旨在评估美国人群的健康和营养状况。
我们分析了2007年至2018年间参与NHANES的29509名非孕成年人的数据。
主要结局是基于机器学习模型通过自我报告的回答预测2型糖尿病(T2DM)。使用准确率、灵敏度、特异度、阳性预测值、阴性预测值和受试者工作特征曲线下面积(AUC)评估五种机器学习算法(逻辑回归、支持向量机、随机森林、XGBoost和CatBoost)的性能。次要结局指标是特征重要性和模型性能比较。
XGBoost表现出最高的总体预测性能(AUC为0.8168),其次是随机森林和逻辑回归(AUC约为0.79)。在准确率方面,逻辑回归、XGBoost和随机森林的表现相似,约为85%。虽然大多数模型显示出高特异度(>97%),但支持向量机的灵敏度最高(58.57%),尽管准确率较低(62.44%)。这种权衡突出了支持向量机在识别更多真阳性病例方面的优势,尽管代价是总体分类精度较低。随机森林模型虽然灵敏度较低(7.15%),但在特异度和可解释性方面提供了最平衡的表现之一。
结果支持使用机器学习模型,特别是XGBoost,用于早期识别有T2DM风险的个体。尽管它们的灵敏度有限,但高特异度和准确率突出了这些模型在非侵入性风险评估中的潜力。本研究在整合机器学习算法以仅基于非侵入性、易于获取的生活方式和人体测量变量预测2型糖尿病方面具有创新性,证明了数据驱动模型在无需实验室检测的情况下进行早期风险评估的潜力。尽管大多数模型的灵敏度较低,但其高特异度使其在临床和公共卫生环境中的早期筛查中具有价值,在这些环境中,可以通过后续评估或优化灵敏度和特异度之间平衡以改善风险分层的集成方法对其进行补充。