Olivera André Rodrigues, Roesler Valter, Iochpe Cirano, Schmidt Maria Inês, Vigo Álvaro, Barreto Sandhi Maria, Duncan Bruce Bartholow
MSc. IT Analyst, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
PhD. Professor, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
Sao Paulo Med J. 2017 May-Jun;135(3):234-246. doi: 10.1590/1516-3180.2016.0309010217.
: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task.
: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil.
: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest.
: The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step.
: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.
2型糖尿病是一种与多种严重健康并发症相关的慢性疾病,对整体健康有重大影响。本文旨在利用成人健康纵向研究(ELSA - 巴西)的数据开发并验证用于检测未诊断糖尿病的预测模型,并比较不同机器学习算法在此任务中的性能。
使用ELSA - 巴西的数据比较机器学习算法以开发预测模型。
从文献中选择27个候选变量的子集后,模型通过四个连续步骤构建和验证:(i)使用十折交叉验证进行参数调整,重复三次;(ii)使用前向选择进行自动变量选择,这是一种带有四种不同机器学习算法和十折交叉验证(重复三次)的包装策略,以评估每个变量子集;(iii)使用十折交叉验证估计模型参数的误差,重复十次;以及(iv)在独立数据集上进行泛化测试。模型使用以下机器学习算法创建:逻辑回归、人工神经网络、朴素贝叶斯、K近邻和随机森林。
使用人工神经网络和逻辑回归创建了最佳模型。在误差估计步骤中,这些模型的曲线下平均面积分别为75.24%和74.98%,在泛化测试步骤中分别为74.17%和74.41%。
大多数预测模型产生了相似的结果,并证明了通过容易获得的临床数据识别未诊断糖尿病可能性最高的个体的可行性。