Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.
The JiVitA Project of Johns Hopkins University, Gaibandha, Bangladesh.
J Med Syst. 2018 Apr 10;42(5):92. doi: 10.1007/s10916-018-0940-7.
Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.
糖尿病是一组代谢性疾病,其特征是血糖水平过高。2017 年,全球约有 8.8%的人患有糖尿病。预计到 2045 年,这一比例将接近 10%。主要的挑战是,当基于机器学习的分类器应用于此类数据集进行风险分层时,其性能会降低。因此,我们的目标是在假设缺失值或异常值用中位数替换后,开发一个优化和稳健的机器学习(ML)系统,以提高风险分层的准确性。这种基于 ML 的风险分层是设计、优化和评估的,其中:(i)从六个特征选择技术(随机森林、逻辑回归、互信息、主成分分析、方差分析和 Fisher 判别比)中提取和优化特征,并与十种不同类型的分类器(线性判别分析、二次判别分析、朴素贝叶斯、高斯过程分类、支持向量机、人工神经网络、Adaboost、逻辑回归、决策树和随机森林)相结合,假设缺失值和异常值用计算的中位数替换后将提高风险分层的准确性。使用了皮马印第安糖尿病数据集(768 名患者:268 名糖尿病患者和 500 名对照者)。我们的结果表明,在用组中位数和中位数分别替换缺失值和异常值,然后进一步使用随机森林特征选择和随机森林分类技术的组合,可以得到以下准确性、敏感度、特异性、阳性预测值、阴性预测值和曲线下面积:92.26%、95.96%、79.72%、91.14%、91.20%和 0.93。这比文献中之前开发的技术提高了 10%。该系统的稳定性和可靠性得到了验证。当异常值用中位数替换时,基于 RF 的模型表现出了最好的性能。