Fernandes Marta, Sun Haoqi, Jain Aayushee, Alabsi Haitham S, Brenner Laura N, Ye Elissa, Ge Wendong, Collens Sarah I, Leone Michael J, Das Sudeshna, Robbins Gregory K, Mukerji Shibani S, Westover M Brandon
Department of Neurology, Massachusetts General Hospital, Boston, MA, United States.
Clinical Data Animation Center, Boston, MA, United States.
JMIR Med Inform. 2021 Feb 10;9(2):e25457. doi: 10.2196/25457.
Medical notes are a rich source of patient data; however, the nature of unstructured text has largely precluded the use of these data for large retrospective analyses. Transforming clinical text into structured data can enable large-scale research studies with electronic health records (EHR) data. Natural language processing (NLP) can be used for text information retrieval, reducing the need for labor-intensive chart review. Here we present an application of NLP to large-scale analysis of medical records at 2 large hospitals for patients hospitalized with COVID-19.
Our study goal was to develop an NLP pipeline to classify the discharge disposition (home, inpatient rehabilitation, skilled nursing inpatient facility [SNIF], and death) of patients hospitalized with COVID-19 based on hospital discharge summary notes.
Text mining and feature engineering were applied to unstructured text from hospital discharge summaries. The study included patients with COVID-19 discharged from 2 hospitals in the Boston, Massachusetts area (Massachusetts General Hospital and Brigham and Women's Hospital) between March 10, 2020, and June 30, 2020. The data were divided into a training set (70%) and hold-out test set (30%). Discharge summaries were represented as bags-of-words consisting of single words (unigrams), bigrams, and trigrams. The number of features was reduced during training by excluding n-grams that occurred in fewer than 10% of discharge summaries, and further reduced using least absolute shrinkage and selection operator (LASSO) regularization while training a multiclass logistic regression model. Model performance was evaluated using the hold-out test set.
The study cohort included 1737 adult patients (median age 61 [SD 18] years; 55% men; 45% White and 16% Black; 14% nonsurvivors and 61% discharged home). The model selected 179 from a vocabulary of 1056 engineered features, consisting of combinations of unigrams, bigrams, and trigrams. The top features contributing most to the classification by the model (for each outcome) were the following: "appointments specialty," "home health," and "home care" (home); "intubate" and "ARDS" (inpatient rehabilitation); "service" (SNIF); "brief assessment" and "covid" (death). The model achieved a micro-average area under the receiver operating characteristic curve value of 0.98 (95% CI 0.97-0.98) and average precision of 0.81 (95% CI 0.75-0.84) in the testing set for prediction of discharge disposition.
A supervised learning-based NLP approach is able to classify the discharge disposition of patients hospitalized with COVID-19. This approach has the potential to accelerate and increase the scale of research on patients' discharge disposition that is possible with EHR data.
医疗记录是患者数据的丰富来源;然而,非结构化文本的性质在很大程度上阻碍了这些数据用于大型回顾性分析。将临床文本转化为结构化数据能够利用电子健康记录(EHR)数据开展大规模研究。自然语言处理(NLP)可用于文本信息检索,减少劳动密集型的病历审查需求。在此,我们展示了NLP在两家大型医院对新冠肺炎住院患者的大规模病历分析中的应用。
我们的研究目标是开发一个NLP管道,根据医院出院小结对新冠肺炎住院患者的出院处置情况(回家、住院康复、专业护理住院机构[SNIF]和死亡)进行分类。
对医院出院小结中的非结构化文本应用文本挖掘和特征工程。该研究纳入了2020年3月10日至2020年6月30日期间从马萨诸塞州波士顿地区的两家医院(马萨诸塞州总医院和布莱根妇女医院)出院的新冠肺炎患者。数据被分为训练集(70%)和保留测试集(30%)。出院小结以由单字(一元组)、双字组和三字组组成的词袋表示。在训练过程中,通过排除出现频率低于10%的出院小结中的n元语法来减少特征数量,并在训练多类逻辑回归模型时使用最小绝对收缩和选择算子(LASSO)正则化进一步减少特征数量。使用保留测试集评估模型性能。
研究队列包括1737名成年患者(中位年龄61[标准差18]岁;55%为男性;45%为白人,16%为黑人;14%为非幸存者,61%出院回家)。该模型从1056个工程特征的词汇表中选择了179个,这些特征由一元组、双字组和三字组的组合构成。对模型分类贡献最大的前几个特征(针对每个结果)如下:“预约专科”、“家庭健康”和“家庭护理”(回家);“插管”和“急性呼吸窘迫综合征”(住院康复);“服务”(SNIF);“简要评估”和“新冠”(死亡)。在测试集中,该模型预测出院处置情况的受试者工作特征曲线下的微平均面积值为0.98(95%置信区间0.97 - 0.98),平均精度为0.81(95%置信区间0.75 - 0.84)。
基于监督学习的NLP方法能够对新冠肺炎住院患者的出院处置情况进行分类。这种方法有可能加快并扩大利用EHR数据对患者出院处置情况的研究规模。