Thangaraj Phyllis M, Kummer Benjamin R, Lorberbaum Tal, Elkind Mitchell S V, Tatonetti Nicholas P
Department of Biomedical Informatics, Columbia University, 622 W 168th St., PH-20, New York, NY, 10032, USA.
Department of Systems Biology, Columbia University, New York, NY, USA.
BioData Min. 2020 Dec 7;13(1):21. doi: 10.1186/s13040-020-00230-x.
Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.
Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.
Across all models, we found that the mean AUROC for detecting AIS was 0.963 ± 0.0520 and average precision score 0.790 ± 0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832 ± 0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60-150 fold over expected).
Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models.
准确识别急性缺血性脑卒中(AIS)患者队列对于广泛的临床研究至关重要。利用电子健康记录(EHR)的自动表型分析方法代表了一种全新的队列识别方法,无需当前费力且不可推广的表型算法生成。我们系统地比较和评估了机器学习算法和病例对照组合使用EHR数据对急性缺血性脑卒中患者进行表型分析的能力。
利用三级医疗医院系统中EHR的结构化患者数据,我们构建并评估了机器学习模型,以基于75种不同的病例对照和分类器组合识别AIS患者。然后,我们估计了整个EHR中AIS患者的患病率。最后,我们使用英国生物银行对外验证了模型检测无AIS诊断代码的AIS患者的能力。
在所有模型中,我们发现检测AIS的平均受试者工作特征曲线下面积(AUROC)为0.963±0.0520,平均精确率评分为0.790±0.196,且特征处理最少。使用有AIS诊断代码的病例和无脑血管疾病代码的对照训练的分类器平均F1分数最高(0.832±0.0383)。在外部验证中,我们发现模型预测的AIS队列的最高概率在无AIS诊断代码的AIS患者中显著富集(比预期高60至150倍)。
我们的研究结果支持将机器学习算法作为一种无需使用流程密集型手动特征筛选即可准确识别AIS患者的通用方法。当一组AIS患者不可用时,诊断代码可用于训练分类器模型。