Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Research Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina..
Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
Comput Methods Programs Biomed. 2017 Dec;152:53-70. doi: 10.1016/j.cmpb.2017.09.009. Epub 2017 Sep 14.
Recent progression towards precision medicine has encouraged the use of electronic health records (EHRs) as a source for large amounts of data, which is required for studying the effect of treatments or risk factors in more specific subpopulations. Phenotyping algorithms allow to automatically classify patients according to their particular electronic phenotype thus facilitating the setup of retrospective cohorts. Our objective is to compare the performance of different classification strategies (only using standardized problems, rule-based algorithms, statistical learning algorithms (six learners) and stacked generalization (five versions)), for the categorization of patients according to their diabetic status (diabetics, not diabetics and inconclusive; Diabetes of any type) using information extracted from EHRs.
Patient information was extracted from the EHR at Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. For the derivation and validation datasets, two probabilistic samples of patients from different years (2005: n = 1663; 2015: n = 800) were extracted. The only inclusion criterion was age (≥40 & <80 years). Four researchers manually reviewed all records and classified patients according to their diabetic status (diabetic: diabetes registered as a health problem or fulfilling the ADA criteria; non-diabetic: not fulfilling the ADA criteria and having at least one fasting glycemia below 126 mg/dL; inconclusive: no data regarding their diabetic status or only one abnormal value). The best performing algorithms within each strategy were tested on the validation set.
The standardized codes algorithm achieved a Kappa coefficient value of 0.59 (95% CI 0.49, 0.59) in the validation set. The Boolean logic algorithm reached 0.82 (95% CI 0.76, 0.88). A slightly higher value was achieved by the Feedforward Neural Network (0.9, 95% CI 0.85, 0.94). The best performing learner was the stacked generalization meta-learner that reached a Kappa coefficient value of 0.95 (95% CI 0.91, 0.98).
The stacked generalization strategy and the feedforward neural network showed the best classification metrics in the validation set. The implementation of these algorithms enables the exploitation of the data of thousands of patients accurately.
最近精准医学的发展鼓励使用电子健康记录(EHRs)作为大量数据的来源,这是研究治疗效果或风险因素在更特定亚人群中的作用所必需的。表型算法允许根据患者的特定电子表型自动对患者进行分类,从而方便回顾性队列的建立。我们的目的是比较不同分类策略(仅使用标准化问题、基于规则的算法、统计学习算法(六种学习者)和堆叠泛化(五种版本))在根据 EHR 提取的信息对患者进行糖尿病状态分类(糖尿病患者、非糖尿病患者和不确定;任何类型的糖尿病)方面的性能。
从阿根廷布宜诺斯艾利斯的意大利医院的 EHR 中提取患者信息。对于推导和验证数据集,从不同年份(2005 年:n=1663;2015 年:n=800)中提取了两个患者的概率样本。唯一的纳入标准是年龄(≥40 岁且<80 岁)。四名研究人员手动审查了所有记录,并根据他们的糖尿病状态对患者进行分类(糖尿病:将糖尿病作为健康问题登记或符合 ADA 标准;非糖尿病:不符合 ADA 标准且至少有一次空腹血糖低于 126mg/dL;不确定:没有关于他们的糖尿病状态的数据或只有一个异常值)。在验证集中测试了每种策略中表现最好的算法。
标准化代码算法在验证集中的 Kappa 系数值为 0.59(95%CI 0.49,0.59)。布尔逻辑算法达到 0.82(95%CI 0.76,0.88)。前馈神经网络(0.9,95%CI 0.85,0.94)的数值略高。表现最好的学习者是堆叠泛化元学习者,其 Kappa 系数值为 0.95(95%CI 0.91,0.98)。
在验证集中,堆叠泛化策略和前馈神经网络表现出最好的分类指标。这些算法的实现使我们能够准确地利用数千名患者的数据。