Zou Quan, Qu Kaiyang, Luo Yamei, Yin Dehui, Ju Ying, Tang Hua
School of Computer Science and Technology, Tianjin University, Tianjin, China.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
Front Genet. 2018 Nov 6;9:515. doi: 10.3389/fgene.2018.00515. eCollection 2018.
Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world's diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients' data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.
糖尿病是一种以高血糖为特征的慢性疾病。它可能会引发许多并发症。根据近年来发病率的不断上升,到2040年,全球糖尿病患者将达到6.42亿,这意味着未来每十个成年人中就有一人患有糖尿病。毫无疑问,这一惊人数字需要引起高度关注。随着机器学习的快速发展,机器学习已被应用于医疗卫生的许多方面。在本研究中,我们使用决策树、随机森林和神经网络来预测糖尿病。数据集是中国泸州的医院体检数据。它包含14个属性。在本研究中,采用五折交叉验证来检验模型。为了验证这些方法的普遍适用性,我们选择了一些性能较好的方法进行独立测试实验。我们分别随机选取68994名健康人和糖尿病患者的数据作为训练集。由于数据不平衡,我们随机抽取了5次数据。结果是这五次实验的平均值。在本研究中,我们使用主成分分析(PCA)和最小冗余最大相关性(mRMR)来进行降维。结果表明,当使用所有属性时,随机森林预测可达到最高准确率(ACC = 0.8084)。