Tapak Lily, Mahjub Hossein, Hamidi Omid, Poorolajal Jalal
Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.
Healthc Inform Res. 2013 Sep;19(3):177-85. doi: 10.4258/hir.2013.19.3.177. Epub 2013 Sep 30.
Diabetes is one of the most common non-communicable diseases in developing countries. Early screening and diagnosis play an important role in effective prevention strategies. This study compared two traditional classification methods (logistic regression and Fisher linear discriminant analysis) and four machine-learning classifiers (neural networks, support vector machines, fuzzy c-mean, and random forests) to classify persons with and without diabetes.
The data set used in this study included 6,500 subjects from the Iranian national non-communicable diseases risk factors surveillance obtained through a cross-sectional survey. The obtained sample was based on cluster sampling of the Iran population which was conducted in 2005-2009 to assess the prevalence of major non-communicable disease risk factors. Ten risk factors that are commonly associated with diabetes were selected to compare the performance of six classifiers in terms of sensitivity, specificity, total accuracy, and area under the receiver operating characteristic (ROC) curve criteria.
Support vector machines showed the highest total accuracy (0.986) as well as area under the ROC (0.979). Also, this method showed high specificity (1.000) and sensitivity (0.820). All other methods produced total accuracy of more than 85%, but for all methods, the sensitivity values were very low (less than 0.350).
The results of this study indicate that, in terms of sensitivity, specificity, and overall classification accuracy, the support vector machine model ranks first among all the classifiers tested in the prediction of diabetes. Therefore, this approach is a promising classifier for predicting diabetes, and it should be further investigated for the prediction of other diseases.
糖尿病是发展中国家最常见的非传染性疾病之一。早期筛查和诊断在有效的预防策略中起着重要作用。本研究比较了两种传统分类方法(逻辑回归和费舍尔线性判别分析)以及四种机器学习分类器(神经网络、支持向量机、模糊c均值和随机森林)对糖尿病患者和非糖尿病患者进行分类的效果。
本研究使用的数据集包括通过横断面调查获得的来自伊朗全国非传染性疾病风险因素监测的6500名受试者。所获得的样本基于2005 - 2009年对伊朗人群进行的整群抽样,以评估主要非传染性疾病风险因素的患病率。选择了十个通常与糖尿病相关的风险因素,以比较六种分类器在敏感性、特异性、总准确率和受试者操作特征(ROC)曲线下面积标准方面的性能。
支持向量机显示出最高的总准确率(0.986)以及ROC曲线下面积(0.979)。此外,该方法还显示出高特异性(1.000)和敏感性(0.820)。所有其他方法的总准确率均超过85%,但对于所有方法,敏感性值都非常低(小于0.350)。
本研究结果表明,在敏感性、特异性和总体分类准确性方面,支持向量机模型在所有测试的糖尿病预测分类器中排名第一。因此,这种方法是一种有前景的糖尿病预测分类器,应进一步研究其对其他疾病的预测能力。