Howlader Koushik Chandra, Satu Md Shahriare, Awal Md Abdul, Islam Md Rabiul, Islam Sheikh Mohammed Shariful, Quinn Julian M W, Moni Mohammad Ali
Department of CSTE, Noakhali Science and Technology University, Noakhali, Bangladesh.
Department of MIS, Noakhali Science and Techology University, Noakhali, Bangladesh.
Health Inf Sci Syst. 2022 Feb 9;10(1):2. doi: 10.1007/s13755-021-00168-2. eCollection 2022 Dec.
Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population.
The online version contains supplementary material available at 10.1007/s13755-021-00168-2.
2型糖尿病(T2D)是一种慢性病,其特征是由于胰岛素抵抗和胰腺胰岛素分泌减少导致血糖水平异常升高。这项工作的挑战在于识别与T2D相关的特征,以便区分T2D亚型,用于预后和治疗。因此,我们采用机器学习(ML)技术,利用来自Kaggle ML库的皮马印第安人糖尿病数据集对T2D患者进行分类。经过数据预处理后,使用了几种特征选择技术来提取特征子集,并使用一系列分类技术对其进行分析。然后,我们比较了所得的分类结果,通过考虑准确率、kappa统计量、受试者工作特征曲线下面积(AUROC)、敏感性、特异性和对数损失(logloss)来确定最佳分类器。为了评估不同分类器的性能,我们使用重采样分布的汇总统计量来研究它们的结果。因此,广义增强回归模型显示出最高的准确率(90.91%),其次是kappa统计量(78.77%)和特异性(85.19%)。此外,稀疏距离加权判别、使用局部加权散点平滑估计(LOESS)的广义相加模型和增强广义相加模型分别给出了最大敏感性(100%)、最高AUROC(95.26%)和最低对数损失(30.98%)。值得注意的是,根据非参数弗里德曼检验,使用LOESS的广义相加模型是排名最高的算法。在这些机器学习模型识别出的特征中,血糖水平、体重指数、糖尿病家族史函数和年龄一直被确定为最佳且最常准确的结果预测指标。这些结果表明ML方法在构建改进的T2D预测模型中的效用,并成功识别了该皮马印第安人群体的结果预测指标。
在线版本包含可在10.1007/s13755-021-00168-2获取的补充材料。