Gray Geoffrey M, Zirikly Ayah, Ahumada Luis M, Rouhizadeh Masoud, Richards Thomas, Kitchen Christopher, Foroughmand Iman, Hatef Elham
Center for Pediatric Data Science and Analytic Methodology, Johns Hopkins All Children's Hospital, St. Petersburg, FL, United States.
Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, United States.
JAMIA Open. 2023 Oct 4;6(4):ooad085. doi: 10.1093/jamiaopen/ooad085. eCollection 2023 Dec.
To develop and test a scalable, performant, and rule-based model for identifying 3 major domains of social needs (residential instability, food insecurity, and transportation issues) from the unstructured data in electronic health records (EHRs).
We included patients aged 18 years or older who received care at the Johns Hopkins Health System (JHHS) between July 2016 and June 2021 and had at least 1 unstructured (free-text) note in their EHR during the study period. We used a combination of manual lexicon curation and semiautomated lexicon creation for feature development. We developed an initial rules-based pipeline (Match Pipeline) using 2 keyword sets for each social needs domain. We performed rule-based keyword matching for distinct lexicons and tested the algorithm using an annotated dataset comprising 192 patients. Starting with a set of expert-identified keywords, we tested the adjustments by evaluating false positives and negatives identified in the labeled dataset. We assessed the performance of the algorithm using measures of precision, recall, and 1 score.
The algorithm for identifying residential instability had the best overall performance, with a weighted average for precision, recall, and 1 score of 0.92, 0.84, and 0.92 for identifying patients with homelessness and 0.84, 0.82, and 0.79 for identifying patients with housing insecurity. Metrics for the food insecurity algorithm were high but the transportation issues algorithm was the lowest overall performing metric.
The NLP algorithm in identifying social needs at JHHS performed relatively well and would provide the opportunity for implementation in a healthcare system.
The NLP approach developed in this project could be adapted and potentially operationalized in the routine data processes of a healthcare system.
开发并测试一种可扩展、高性能且基于规则的模型,用于从电子健康记录(EHR)中的非结构化数据识别社会需求的3个主要领域(居住不稳定、粮食不安全和交通问题)。
我们纳入了2016年7月至2021年6月期间在约翰霍普金斯医疗系统(JHHS)接受治疗且年龄在18岁及以上、在研究期间其EHR中至少有1条非结构化(自由文本)记录的患者。我们使用手动词汇编纂和半自动词汇创建相结合的方法进行特征开发。我们针对每个社会需求领域使用2个关键词集开发了一个初始的基于规则的流程(匹配流程)。我们对不同的词汇进行基于规则的关键词匹配,并使用包含192名患者的注释数据集测试该算法。从一组专家确定的关键词开始,我们通过评估在标记数据集中识别出的假阳性和假阴性来测试调整情况。我们使用精确率、召回率和F1分数来评估算法的性能。
识别居住不稳定的算法总体性能最佳,识别无家可归患者时精确率、召回率和F1分数的加权平均值分别为0.92、0.84和0.92,识别住房不安全患者时分别为0.84、0.82和0.79。粮食不安全算法的指标较高,但交通问题算法的总体性能指标最低。
JHHS中用于识别社会需求的自然语言处理算法表现相对较好,将为在医疗系统中实施提供机会。
本项目开发的自然语言处理方法可在医疗系统的常规数据流程中进行调整并可能投入使用。