Shojaee-Mend Hassan, Velayati Farnia, Tayefi Batool, Babaee Ebrahim
Infectious Diseases Research Center, Gonabad University of Medical Sciences, Gonabad, Iran.
Telemedicine Research Center, National Research Institute of Tuberculosis and Lung Diseases (NRITLD), Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Healthc Inform Res. 2024 Jan;30(1):73-82. doi: 10.4258/hir.2024.30.1.73. Epub 2024 Jan 31.
This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life.
This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model.
Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC).
A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.
本研究旨在利用机器学习和数据挖掘技术开发一种模型,以预测空腹血糖状态,因为糖尿病的早期诊断和治疗可以改善治疗效果和生活质量。
这项横断面研究分析了来自伊朗德黑兰16个综合健康服务中心的3376名30岁以上成年人的数据,这些人参与了一项糖尿病筛查项目。使用随机抽样和合成少数过采样技术(SMOTE)对数据集进行平衡处理。将数据集分为训练集(80%)和测试集(20%)。计算Shapley值以选择最重要的特征。通过向数值特征添加高斯噪声来进行噪声分析,以评估特征重要性的稳健性。使用包括CatBoost、随机森林、XGBoost、逻辑回归和人工神经网络在内的五种不同机器学习算法对数据集进行建模。使用准确率、灵敏度、特异性、精确率、F1分数和曲线下面积来评估模型。
年龄、腰臀比、体重指数和收缩压是预测空腹血糖状态的最重要因素。尽管这些模型具有相似的预测能力,但CatBoost模型总体表现略好,曲线下面积(AUC)为0.737。
梯度提升决策树模型准确识别了与糖尿病相关的最重要风险因素。年龄、腰臀比、体重指数和收缩压分别是糖尿病最重要的风险因素。该模型可为糖尿病管理和预防规划提供支持。