Bunnell H Timothy, Reedy Cara, Lorman Vitaly, Jhaveri Ravi, Rivera-Sepulveda Andrea, Salamon Katherine S, Patel Payal B, Morse Keith E, Davenport Mattina A, Cowell Lindsay G, Utidjian Levon, Christakis Dimitri A, Rao Suchitra, Sills Marion R, Case Abigail, Mendonca Eneida A, Taylor Bradley W, Rutter Jacqueline, Martinez Aaron Thomas, Letts Rebecca, Bailey L Charles, Forrest Christopher B
Biomedical Research Informatics Center, Nemours Children's Health, Wilmington, DE 19803, United States.
Applied Clinical Research Center, Children's Hospital of Philadelphia, Philadelphia, PA 19104, United States.
JAMIA Open. 2025 Sep 4;8(5):ooaf089. doi: 10.1093/jamiaopen/ooaf089. eCollection 2025 Oct.
To develop a natural language processing (NLP) pipeline for unstructured electronic health record (EHR) data to identify symptoms and functional impacts associated with Long COVID in children.
We analyzed 48 287 outpatient progress notes from 10 618 pediatric patients from 12 institutions. We evaluated notes obtained 28 to 179 days after a COVID-19 diagnosis or positive test. Two samples were examined: patients with evidence of Long COVID and patients with acute COVID but no evidence of Long COVID based on diagnostic codes. The pipeline identified clinical concepts associated with 21 symptoms and 4 functional impact categories. Subject matter experts (SMEs) screened a sample of 4586 terms from the NLP output to assess pipeline accuracy. Prevalence and concordance of each of the 25 concepts was compared between the 2 patient samples.
A binary assertion measure comparing SME and NLP assertions showed moderate accuracy (N = 4133; F1 = .80) and improved substantially when only high-confidence SME assertions were considered (N = 2043; F1 = .90). Overall, the 25 Long COVID concept categories were markedly more prevalent in the presumptive Long COVID cohort, and differences were noted between concepts identified in notes versus structured data.
This preliminary analysis illustrates the additional insight into a syndrome such as Long COVID gained from incorporating notes data, characterizing symptoms and functional impacts.
These data support the importance of incorporating NLP methodology when possible into designing computable phenotypes and to accurately characterize patients with Long COVID.
开发一种用于非结构化电子健康记录(EHR)数据的自然语言处理(NLP)流程,以识别与儿童长期新冠相关的症状和功能影响。
我们分析了来自12家机构的10618名儿科患者的48287份门诊病程记录。我们评估了在新冠病毒病诊断或检测呈阳性后28至179天获得的病程记录。检查了两个样本:有长期新冠证据的患者和基于诊断代码有急性新冠但无长期新冠证据的患者。该流程识别了与21种症状和4种功能影响类别相关的临床概念。主题专家(SME)从NLP输出中筛选了4586个术语样本,以评估流程的准确性。比较了两个患者样本中25个概念各自的患病率和一致性。
比较SME和NLP断言的二元断言度量显示出中等准确性(N = 4133;F1 = 0.80),当仅考虑高置信度的SME断言时,准确性大幅提高(N = 2043;F1 = (此处原文F1值缺失,请补充完整后翻译))。总体而言,25个长期新冠概念类别在推定的长期新冠队列中明显更为普遍,并且在病程记录与结构化数据中识别出的概念之间存在差异。
这项初步分析说明了通过纳入病程记录数据、表征症状和功能影响,对长期新冠等综合征有了更多见解。
这些数据支持在设计可计算表型时尽可能纳入NLP方法以及准确表征长期新冠患者的重要性。