Suppr超能文献

与仅使用诊断代码相比,自然语言处理可提高对新冠肺炎的可靠识别率。

Natural Language Processing Improves Reliable Identification of COVID-19 Compared to Diagnostic Codes Alone.

作者信息

Hendrix Nathaniel, Parikh Rishi V, Taskier Madeline, Walter Grace, Phillips Robert L, Rehkopf David H

机构信息

Center for Professionalism and Value in Health Care, American Board of Family Medicine.

Department of Epidemiology and Population Health, Stanford School of Medicine.

出版信息

Am J Epidemiol. 2025 Jul 30. doi: 10.1093/aje/kwaf162.

Abstract

Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21,659 patients (out of 1,560,564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% versus 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.

摘要

关于新冠病毒病(COVID-19)的观察性研究通常依赖诊断代码,但其准确性以及在不同患者亚组中出现差异错误分类的可能性尚不清楚。在这项概念验证研究中,我们通过比较诊断代码与基于临床记录自然语言处理(NLP)的分类器的分类准确性,研究了年龄、种族和族裔作为差异错误分类预测因素的情况。我们在美国家庭队列的两个基于初级保健的样本中评估了差异错误分类:首先,一组5000名有COVID-19状态的患者,医生根据记录进行评估;其次,21659名(在1560564名中)接受了COVID特异性抗病毒药物治疗的患者。利用带注释的记录数据,我们训练并测试了三个NLP分类器(基于树的、循环神经网络和基于Transformer的)。两个样本中约63%可能患有COVID-19的患者有记录在案的COVID-19 ICD-10代码。年轻患者的敏感性最高(<18岁的患者为68.6%,而75岁及以上的患者为60.6%),西班牙裔患者也是如此(68.0%,而黑人/非裔美国患者为58.5%)。基于树的分类器在ROC曲线下的面积最大(0.92),尽管在老年患者中准确性较低。NLP在预测训练后收集的数据时性能大幅下降。虽然NLP可能会改善队列识别,但可能需要频繁重新训练以捕捉不断变化的记录情况。

相似文献

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验