Suppr超能文献

一种用于诊断糖尿病的生物标志物驱动且可解释的机器学习模型。

A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus.

作者信息

Xiao Zhihui, Wang Mingfu, Zhao Yueliang, Wang Hui

机构信息

College of Food Science and Technology Shanghai Ocean University Shanghai China.

Shenzhen Key Laboratory of Food Nutrition and Health, College of Chemistry and Environmental Engineering Shenzhen University Shenzhen China.

出版信息

Food Sci Nutr. 2025 Apr 30;13(5):e70234. doi: 10.1002/fsn3.70234. eCollection 2025 May.

Abstract

Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943-0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.

摘要

糖尿病是全球主要的死亡和致残原因之一。开发更早、更准确的诊断方法对于糖尿病的临床预防和治疗至关重要。在此,收集了2017年至2020年美国国家健康与营养检查调查(NHANES)数据库中4335名参与者的生化指标和生理特征数据。经过数据预处理后,将数据集随机分为训练集(70%)和测试集(30%);然后使用Boruta算法在训练集上筛选特征指标。接下来,采用随机森林(RF)、多层感知器(MLP)和极端梯度提升(XGBoost)三种机器学习算法,通过对训练数据集进行10折交叉验证来构建预测模型,随后在测试数据集上进行性能评估。RF模型表现出最佳性能,曲线下面积(AUC)为0.958(95%置信区间:0.943 - 0.973),召回率为0.897,特异性和F1分数分别为0.916和0.747,总体准确率为0.913。此外,应用SHapley加法解释(SHAP)和局部依赖图(PDP)来解释RF模型,以分析糖尿病的危险因素。发现糖化血红蛋白、葡萄糖、空腹血糖、年龄、胆固醇、渗透压、体重指数、血尿素氮和胰岛素对糖尿病患病率影响最大。总体而言,RF模型在糖尿病诊断方面具有相当大的应用前景,可作为糖尿病临床诊断和风险评估的有价值补充工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f0/12041655/20b8acf9dec0/FSN3-13-e70234-g006.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验