与仅使用诊断代码相比，自然语言处理可提高对新冠肺炎的可靠识别率。

Natural Language Processing Improves Reliable Identification of COVID-19 Compared to Diagnostic Codes Alone.

作者信息

Hendrix Nathaniel, Parikh Rishi V, Taskier Madeline, Walter Grace, Phillips Robert L, Rehkopf David H

机构信息

Center for Professionalism and Value in Health Care, American Board of Family Medicine.

Department of Epidemiology and Population Health, Stanford School of Medicine.

出版信息

Am J Epidemiol. 2025 Jul 30. doi: 10.1093/aje/kwaf162.

DOI:10.1093/aje/kwaf162

PMID:40731247

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12335755/

Abstract

Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21,659 patients (out of 1,560,564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% versus 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.

摘要

关于新冠病毒病（COVID-19）的观察性研究通常依赖诊断代码，但其准确性以及在不同患者亚组中出现差异错误分类的可能性尚不清楚。在这项概念验证研究中，我们通过比较诊断代码与基于临床记录自然语言处理（NLP）的分类器的分类准确性，研究了年龄、种族和族裔作为差异错误分类预测因素的情况。我们在美国家庭队列的两个基于初级保健的样本中评估了差异错误分类：首先，一组5000名有COVID-19状态的患者，医生根据记录进行评估；其次，21659名（在1560564名中）接受了COVID特异性抗病毒药物治疗的患者。利用带注释的记录数据，我们训练并测试了三个NLP分类器（基于树的、循环神经网络和基于Transformer的）。两个样本中约63%可能患有COVID-19的患者有记录在案的COVID-19 ICD-10代码。年轻患者的敏感性最高（<18岁的患者为68.6%，而75岁及以上的患者为60.6%），西班牙裔患者也是如此（68.0%，而黑人/非裔美国患者为58.5%）。基于树的分类器在ROC曲线下的面积最大（0.92），尽管在老年患者中准确性较低。NLP在预测训练后收集的数据时性能大幅下降。虽然NLP可能会改善队列识别，但可能需要频繁重新训练以捕捉不断变化的记录情况。

相似文献

Natural Language Processing Improves Reliable Identification of COVID-19 Compared to Diagnostic Codes Alone.与仅使用诊断代码相比，自然语言处理可提高对新冠肺炎的可靠识别率。

Am J Epidemiol. 2025 Jul 30. doi: 10.1093/aje/kwaf162.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.样本采集部位和采集程序对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）感染鉴定的影响。

Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Performance of Natural Language Processing versus International Classification of Diseases Codes in Building Registries for Patients With Fall Injury: Retrospective Analysis.自然语言处理与国际疾病分类编码在构建跌倒损伤患者登记册中的性能：回顾性分析

JMIR Med Inform. 2025 Jul 14;13:e66973. doi: 10.2196/66973.

Sexual Harassment and Prevention Training性骚扰与预防培训

Risk of thromboembolism in patients with COVID-19 who are using hormonal contraception.COVID-19 患者使用激素避孕的血栓栓塞风险。

Cochrane Database Syst Rev. 2023 Jan 9;1(1):CD014908. doi: 10.1002/14651858.CD014908.pub2.

Racial and ethnic disparities in fecundability: a North American preconception cohort study.生育力方面的种族和族裔差异：一项北美孕前队列研究。

Hum Reprod. 2025 Apr 17. doi: 10.1093/humrep/deaf067.

Validation of administrative health data for the identification of endometriosis diagnosis.用于识别子宫内膜异位症诊断的行政健康数据验证

Hum Reprod. 2025 Feb 1;40(2):289-295. doi: 10.1093/humrep/deae281.

本文引用的文献

Use of Natural Language Processing of Patient-Initiated Electronic Health Record Messages to Identify Patients With COVID-19 Infection.利用自然语言处理技术对患者发起的电子健康记录消息进行分析，以识别 COVID-19 感染患者。

JAMA Netw Open. 2023 Jul 3;6(7):e2322299. doi: 10.1001/jamanetworkopen.2023.22299.

Leveraging natural language processing to identify eligible lung cancer screening patients with the electronic health record.利用自然语言处理技术从电子健康记录中识别符合条件的肺癌筛查患者。

Int J Med Inform. 2023 Sep;177:105136. doi: 10.1016/j.ijmedinf.2023.105136. Epub 2023 Jun 26.

Natural Language Processing for Adjudication of Heart Failure in the Electronic Health Record.电子健康记录中用于判定心力衰竭的自然语言处理

JACC Heart Fail. 2023 Jul;11(7):852-854. doi: 10.1016/j.jchf.2023.02.012. Epub 2023 Mar 5.

Nationwide Analysis of the Outcomes and Mortality of Hospitalized COVID-19 Patients.全国范围内 COVID-19 住院患者结局和死亡率的分析。

Curr Probl Cardiol. 2023 Feb;48(2):101440. doi: 10.1016/j.cpcardiol.2022.101440. Epub 2022 Oct 8.

Identification of Patients With Metastatic Prostate Cancer With Natural Language Processing and Machine Learning.基于自然语言处理和机器学习的转移性前列腺癌患者识别。

JCO Clin Cancer Inform. 2022 Oct;6:e2100071. doi: 10.1200/CCI.21.00071.

Assessment of performance characteristics of COVID-19 ICD-10-CM diagnosis code U07.1 using SARS-CoV-2 nucleic acid amplification test results.利用 SARS-CoV-2 核酸扩增检测结果评估 COVID-19 ICD-10-CM 诊断代码 U07.1 的性能特征。

PLoS One. 2022 Aug 18;17(8):e0273196. doi: 10.1371/journal.pone.0273196. eCollection 2022.

Risk of myocarditis and pericarditis after the COVID-19 mRNA vaccination in the USA: a cohort study in claims databases.美国 COVID-19 mRNA 疫苗接种后的心肌炎和心包炎风险：索赔数据库中的队列研究。

Lancet. 2022 Jun 11;399(10342):2191-2199. doi: 10.1016/S0140-6736(22)00791-7.

Identifying who has long COVID in the USA: a machine learning approach using N3C data.在美国识别长新冠患者：使用 N3C 数据的机器学习方法。

Lancet Digit Health. 2022 Jul;4(7):e532-e541. doi: 10.1016/S2589-7500(22)00048-6. Epub 2022 May 16.

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system.自动化医疗图表审查在乳腺癌结局研究中的应用：一种新颖的自然语言处理提取系统。

BMC Med Res Methodol. 2022 May 12;22(1):136. doi: 10.1186/s12874-022-01583-z.

Informative presence bias in analyses of electronic health records-derived data: a cautionary note.电子健康记录衍生数据分析中的信息性存在偏差：一则警示

J Am Med Inform Assoc. 2022 Jun 14;29(7):1191-1199. doi: 10.1093/jamia/ocac050.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验