Pan Jie, Zhang Zilong, Peters Steven Ray, Vatanpour Shabnam, Walker Robin L, Lee Seungwon, Martin Elliot A, Quan Hude
Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
Brain Inform. 2023 Sep 2;10(1):22. doi: 10.1186/s40708-023-00203-w.
Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders' abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes.
CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients' chart data were linked to administrative discharge abstract database (DAD) and Sunrise Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease ("nursing transfer report," "discharge summary," "nursing notes," and "inpatient consultation."). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, "Cerebrovascular accident" and "Transient ischemic attack"), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%).
The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.
通过自然语言处理(NLP)从住院电子病历(EMR)中提取脑血管疾病(CeVD)对于自动疾病监测和改善患者预后至关重要。现有方法依赖于编码员的提取,存在时间延迟和编码不足的问题。本研究旨在开发一种基于NLP的方法,利用EMR临床记录检测CeVD。
通过对2015年1月1日至6月30日期间在加拿大艾伯塔省卡尔加里市3家医院出院的18岁及以上随机选择的住院患者进行病历审查,确认CeVD状态。这些患者的病历数据通过个人健康号码(一个唯一的终身标识符)和入院日期与行政出院摘要数据库(DAD)和Sunrise Clinical Manager(SCM)EMR数据库记录相链接。我们通过结合两种临床概念提取方法和两种监督机器学习(ML)方法:随机森林和XGBoost,训练了多个自然语言处理(NLP)预测模型。以病历审查作为参考标准,我们在敏感性、特异性、阳性预测值(PPV)和阴性预测值(NPV)指标上,将模型性能与常用的国际疾病分类(ICD-10-CA)代码的性能进行了比较。
在研究样本(n = 3036)中,CeVD的患病率为11.8%(n = 3,60);患者年龄中位数为63岁;根据病历数据,女性占50.3%(n = 1528)。在从EMR中提取的49份临床文档中,四种文档类型被确定为识别CeVD疾病最具影响力的文本来源( “护理转接报告”、“出院小结 ”、“护理记录” 和 “住院会诊”)。表现最佳的NLP模型是XGBoost模型,它结合了由cTAKES提取的统一医学语言系统概念(例如,排名靠前的概念,“脑血管意外” 和 “短暂性脑缺血发作”)以及词频-逆文档频率向量器。与ICD代码相比,该模型总体上具有更高的有效性,如敏感性(25.0% 对70.0%)、特异性(99.3% 对99.1%)、PPV(82.6对87.8)和NPV(90.8% 对97.1%)。
本研究中开发的NLP算法在检测CeVD方面比ICD代码算法表现更好。NLP模型可以产生一个用于识别CeVD病例的自动化EMR工具,并应用于未来的研究,如监测和纵向研究。