Farnoosh Rahman, Abnoosian Karlo, Isewid Rasha Abbas
The School of Mathematics and Computer Science, Statistics, Iran University of Science and Technology, Tehran, Iran.
J Med Signals Sens. 2025 Apr 19;15:11. doi: 10.4103/jmss.jmss_29_24. eCollection 2025.
The global increase in diabetes prevalence necessitates advanced diagnostic methods. Machine learning has shown promise in disease diagnosis, including diabetes.
We used a dataset collected from the Medical City Hospital laboratory and the Specialized Center for Endocrinology and Diabetes at Al-Kindy Teaching Hospital in Iraq. This dataset includes 1000 physical examination samples from both male and female patients. The samples are categorized into three classes: diabetic (Y), nondiabetic (N), and predicted diabetic (P). The dataset contains twelve attributes and includes outlier data. Outliers in medical studies can result from unusual disease attributes. Therefore, consulting with a specialist physician to identify and handle these outliers using statistical methods is necessary. The main contribution of this study is the proposal of two hybrid models for diabetes diagnosis in two scenarios: (1) Scenario 1 (presence of outlier data): Hybrid Model 1 combines the K-medoids clustering algorithm with a Gaussian naive Bayes (GNB) classifier based on kernel density estimation (KDE) to handle outliers and (2) Scenario 2 (after removing outlier data): Hybrid Model 2 combines the K-means clustering algorithm with a GNB classifier based on KDE with suitable bandwidth. We performed principal component analysis to minimize dimensionality and evaluated the models using fivefold cross-validation.
All experiments were conducted in identical settings. Our proposed hybrid models demonstrated superior performance in two scenarios, handling and rejecting outliers, compared to other machine-learning models in this study, including support vector machines (with radial-based, polynomial, linear, and sigmoid kernel functions), decision trees (J48), and GNB classifiers for diabetes prediction. The average accuracy for Scenario 1 with Hybrid Model 1 was 0.9743, and for Scenario 2 with Hybrid Model 2, it was 0.9867. We also evaluated precision, sensitivity, and F1-score as performance metrics.
This study presents two hybrid models for diabetes diagnosis, demonstrating high accuracy in distinguishing between diabetic and nondiabetic patients and effectively handling outliers. The findings highlight the potential of machine-learning techniques for improving the early diagnosis and treatment of diabetes.
全球糖尿病患病率的上升需要先进的诊断方法。机器学习在包括糖尿病在内的疾病诊断中已显示出前景。
我们使用了从伊拉克金迪教学医院的医学城医院实验室和内分泌与糖尿病专科医院收集的数据集。该数据集包括1000份来自男性和女性患者的体格检查样本。样本分为三类:糖尿病患者(Y)、非糖尿病患者(N)和预测糖尿病患者(P)。该数据集包含十二个属性,并且包括异常值数据。医学研究中的异常值可能源于不寻常的疾病属性。因此,有必要咨询专科医生以使用统计方法识别和处理这些异常值。本研究的主要贡献在于针对两种情况提出了两种用于糖尿病诊断的混合模型:(1)情况1(存在异常值数据):混合模型1将K-中心点聚类算法与基于核密度估计(KDE)的高斯朴素贝叶斯(GNB)分类器相结合以处理异常值;(2)情况2(去除异常值数据后):混合模型2将K-均值聚类算法与基于具有合适带宽的KDE的GNB分类器相结合。我们进行了主成分分析以最小化维度,并使用五折交叉验证对模型进行评估。
所有实验均在相同设置下进行。与本研究中的其他机器学习模型(包括支持向量机(具有基于径向、多项式、线性和Sigmoid核函数)、决策树(J48)和用于糖尿病预测的GNB分类器)相比,我们提出的混合模型在处理和排除异常值的两种情况下均表现出卓越的性能。混合模型1在情况1下的平均准确率为0.9743,混合模型2在情况2下的平均准确率为0.9867。我们还将精确率、敏感度和F1分数作为性能指标进行了评估。
本研究提出了两种用于糖尿病诊断的混合模型,在区分糖尿病患者和非糖尿病患者方面显示出高准确率,并能有效处理异常值。研究结果突出了机器学习技术在改善糖尿病早期诊断和治疗方面的潜力。