Park Ji Hwan, Cho Han Eol, Kim Jong Hun, Wall Melanie M, Stern Yaakov, Lim Hyunsun, Yoo Shinjae, Kim Hyoung Seop, Cha Jiook
1Computational Science Initiative, Brookhaven National Laboratory, Upton, NY 11973 USA.
2Department of Rehabilitation Medicine, Gangnam Severance Hospital and Rehabilitation Institute of Neuromuscular Disease, Yonsei University College of Medicine, Seoul, Korea.
NPJ Digit Med. 2020 Mar 26;3:46. doi: 10.1038/s41746-020-0256-0. eCollection 2020.
Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals' history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer's disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years ( = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: "definite AD" with diagnostic codes and dementia medication ( = 614) and "probable AD" with only diagnosis ( = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on "definite AD" and "probable AD" outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.
基于全国人口的队列研究提供了一个新机会,可在现有风险预测模型的基础上,根据个人的健康和医疗史构建自动化风险预测模型。我们利用大规模行政健康数据测试了机器学习模型预测阿尔茨海默病(AD)未来发病率的可能性。从2002年至2010年的韩国国民健康保险服务数据库中,我们获取了65岁以上老年人(n = 40,736)的匿名健康数据,其中包含4,894种独特的临床特征,包括国际疾病分类第十版(ICD - 10)编码、药物编码、实验室检查值、个人和家族疾病史以及社会人口统计学信息。为定义AD发病情况,我们考虑了两种操作定义:有诊断编码和痴呆症药物治疗的“确诊AD”(n = 614)以及仅有诊断的“可能AD”(n = 2026)。我们训练并验证了随机森林、支持向量机和逻辑回归模型,以预测后续1、2、3和4年的AD发病情况。对于在平衡样本(自助抽样)中预测AD的未来发病率,基于“确诊AD”和“可能AD”结果,机器学习模型在1年预测中表现出合理性能,曲线下面积(AUC)分别为0.775和0.759;在2年时,分别为0.730和0.693;在3年时,分别为0.677和0.644;在4年时,分别为0.725和0.683。使用整个(不平衡)样本时结果相似。逻辑回归中选择的重要临床特征包括血红蛋白水平、年龄和尿蛋白水平。本研究可能为基于大规模行政健康数据的数据驱动机器学习模型在AD风险预测中的效用提供启示,这可能有助于在临床试验中更好地选择AD高危个体或在临床环境中进行早期检测。