Hendrix Nathaniel, Parikh Rishi V, Taskier Madeline, Walter Grace, Phillips Robert L, Rehkopf David H
Center for Professionalism and Value in Health Care, American Board of Family Medicine.
Department of Epidemiology and Population Health, Stanford School of Medicine.
Am J Epidemiol. 2025 Jul 30. doi: 10.1093/aje/kwaf162.
Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21,659 patients (out of 1,560,564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% versus 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.
关于新冠病毒病(COVID-19)的观察性研究通常依赖诊断代码,但其准确性以及在不同患者亚组中出现差异错误分类的可能性尚不清楚。在这项概念验证研究中,我们通过比较诊断代码与基于临床记录自然语言处理(NLP)的分类器的分类准确性,研究了年龄、种族和族裔作为差异错误分类预测因素的情况。我们在美国家庭队列的两个基于初级保健的样本中评估了差异错误分类:首先,一组5000名有COVID-19状态的患者,医生根据记录进行评估;其次,21659名(在1560564名中)接受了COVID特异性抗病毒药物治疗的患者。利用带注释的记录数据,我们训练并测试了三个NLP分类器(基于树的、循环神经网络和基于Transformer的)。两个样本中约63%可能患有COVID-19的患者有记录在案的COVID-19 ICD-10代码。年轻患者的敏感性最高(<18岁的患者为68.6%,而75岁及以上的患者为60.6%),西班牙裔患者也是如此(68.0%,而黑人/非裔美国患者为58.5%)。基于树的分类器在ROC曲线下的面积最大(0.92),尽管在老年患者中准确性较低。NLP在预测训练后收集的数据时性能大幅下降。虽然NLP可能会改善队列识别,但可能需要频繁重新训练以捕捉不断变化的记录情况。