Olorunfemi Blessing Oluwatobi, Ogunde Adewale Opeoluwa, Almogren Ahmad, Adeniyi Abidemi Emmanuel, Ajagbe Sunday Adeola, Bharany Salil, Altameem Ayman, Rehman Ateeq Ur, Mehmood Asif, Hamam Habib
Department of Computer Science, Faculty of Natural Sciences, Redeemer's University, Ede, Osun state, Nigeria.
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, 11633, Saudi Arabia.
Sci Rep. 2025 Jan 25;15(1):3235. doi: 10.1038/s41598-025-87767-1.
Diabetes is a growing health concern in developing countries, causing considerable mortality rates. While machine learning (ML) approaches have been widely used to improve early detection and treatment, several studies have shown low classification accuracies due to overfitting, underfitting, and data noise. This research employs parallel and sequential ensemble ML approaches paired with feature selection techniques to boost classification accuracy. The Pima India Diabetes Data from the UCI ML Repository served as the dataset. Data preprocessing included cleaning the dataset by replacing missing values with column means and selecting highly correlated features using forward and backward selection methods. The dataset was split into two parts: training (70%), and testing (30%). Python was used for classification in Jupyter Notebook, and there were two design phases. The first phase utilized J48, Classification and Regression Tree (CART), and Decision Stump (DS) to create a random forest model. The second phase employed the same algorithms alongside sequential ensemble methods-XG Boost, AdaBoostM1, and Gradient Boosting-using an average voting algorithm for binary classification. Evaluation revealed that XG Boost, AdaBoostM1, and Gradient Boosting achieved classification accuracies of 100%, with performance metrics including F1 score, MCC, Precision, Recall, AUC-ROC, and AUC-PR all equal to 1.00, indicating reliable predictions of diabetes presence. Researchers and practitioners can leverage the predictive model developed in this work to make quick predictions of diabetes mellitus, which could save many lives.
糖尿病在发展中国家正成为一个日益严重的健康问题,导致相当高的死亡率。虽然机器学习(ML)方法已被广泛用于改善早期检测和治疗,但一些研究表明,由于过拟合、欠拟合和数据噪声,分类准确率较低。本研究采用并行和顺序集成ML方法,并结合特征选择技术来提高分类准确率。来自UCI机器学习库的皮马印第安人糖尿病数据集用作数据集。数据预处理包括通过用列均值替换缺失值来清理数据集,并使用向前和向后选择方法选择高度相关的特征。数据集被分为两部分:训练集(70%)和测试集(30%)。在Jupyter Notebook中使用Python进行分类,有两个设计阶段。第一阶段利用J48、分类与回归树(CART)和决策树桩(DS)创建随机森林模型。第二阶段使用相同的算法以及顺序集成方法——XG Boost、AdaBoostM1和梯度提升——使用平均投票算法进行二元分类。评估显示,XG Boost、AdaBoostM1和梯度提升的分类准确率达到100%,性能指标包括F1分数、MCC、精确率、召回率、AUC-ROC和AUC-PR均等于1.00,表明对糖尿病存在情况的预测可靠。研究人员和从业者可以利用本研究中开发的预测模型对糖尿病进行快速预测,这可以挽救许多生命。