Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.
Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh.
Comput Methods Programs Biomed. 2017 Dec;152:23-34. doi: 10.1016/j.cmpb.2017.09.004. Epub 2017 Sep 8.
Diabetes is a silent killer. The main cause of this disease is the presence of excessive amounts of metabolites such as glucose. There were about 387 million diabetic people all over the world in 2014. The financial burden of this disease has been calculated to be about $13,700 per year. According to the World Health Organization (WHO), these figures will more than double by the year 2030. This cost will be reduced dramatically if someone can predict diabetes statistically on the basis of some covariates. Although several classification techniques are available, it is very difficult to classify diabetes. The main objectives of this paper are as follows: (i) Gaussian process classification (GPC), (ii) comparative classifier for diabetes data classification, (iii) data analysis using the cross-validation approach, (iv) interpretation of the data analysis and (v) benchmarking our method against others.
To classify diabetes, several classification techniques are used such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and Naive Bayes (NB). However, most of the medical data show non-normality, non-linearity and inherent correlation structure. So in this paper we adapted Gaussian process (GP)-based classification technique using three kernels namely: linear, polynomial and radial basis kernel. We also investigate the performance of a GP-based classification technique in comparison to existing techniques such as LDA, QDA and NB. Performances are evaluated by using the accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) and receiver-operating characteristic (ROC) curves.
Pima Indian diabetes dataset is taken as part of the study. This consists of 768 patients, of which 268 patients are diabetic and 500 patients are controls. Our machine learning system shows the performance of GP-based model as: ACC 81.97%, SE 91.79%, SP 63.33%, PPV 84.91% and NPV 62.50% which are larger compared to other methods.
糖尿病是一种无声的杀手。这种疾病的主要原因是存在过多的代谢物,如葡萄糖。2014 年,全球约有 3.87 亿糖尿病患者。据世界卫生组织(WHO)统计,到 2030 年,这一数字将翻一番以上。如果有人能够根据一些协变量从统计学上预测糖尿病,那么这种疾病的经济负担将大大降低。虽然有几种分类技术,但糖尿病的分类非常困难。本文的主要目的如下:(i)高斯过程分类(GPC),(ii)糖尿病数据分类的比较分类器,(iii)使用交叉验证方法进行数据分析,(iv)数据分析的解释,(v)将我们的方法与其他方法进行基准测试。
为了对糖尿病进行分类,使用了几种分类技术,如线性判别分析(LDA)、二次判别分析(QDA)和朴素贝叶斯(NB)。然而,大多数医学数据表现出非正态性、非线性和固有相关性结构。因此,在本文中,我们采用了基于高斯过程(GP)的分类技术,使用了三种核函数:线性、多项式和径向基核函数。我们还研究了基于 GP 的分类技术与 LDA、QDA 和 NB 等现有技术的性能比较。使用准确性(ACC)、敏感性(SE)、特异性(SP)、阳性预测值(PPV)、阴性预测值(NPV)和接收者操作特征(ROC)曲线来评估性能。
以 Pima 印第安人糖尿病数据集为研究的一部分。该数据集包含 768 名患者,其中 268 名患者患有糖尿病,500 名患者为对照组。我们的机器学习系统显示,基于 GP 的模型的性能为:ACC 81.97%,SE 91.79%,SP 63.33%,PPV 84.91%和 NPV 62.50%,均优于其他方法。