Wang Lijing, Zipursky Amy, Geva Alon, McMurry Andrew J, Mandl Kenneth D, Miller Timothy A
medRxiv. 2023 Jan 19:2023.01.19.23284738. doi: 10.1101/2023.01.19.23284738.
To identify a cohort of COVID-19 cases, including when evidence of virus positivity was only mentioned in the clinical text, not in structured laboratory data in the electronic health record (EHR).
Statistical classifiers were trained on feature representations derived from unstructured text in patient electronic health records (EHRs). We used a proxy dataset of patients COVID-19 polymerase chain reaction (PCR) tests for training. We selected a model based on performance on our proxy dataset and applied it to instances without COVID-19 PCR tests. A physician reviewed a sample of these instances to validate the classifier.
On the test split of the proxy dataset, our best classifier obtained 0.56 F1, 0.6 precision, and 0.52 recall scores for SARS-CoV2 positive cases. In an expert validation, the classifier correctly identified 90.8% (79/87) as COVID-19 positive and 97.8% (91/93) as not SARS-CoV2 positive. The classifier identified an additional 960 positive cases that did not have SARS-CoV2 lab tests in hospital, and only 177 of those cases had the ICD-10 code for COVID-19.
Proxy dataset performance may be worse because these instances sometimes include discussion of pending lab tests. The most predictive features are meaningful and interpretable. The type of external test that was performed is rarely mentioned.
COVID-19 cases that had testing done outside of the hospital can be reliably detected from the text in EHRs. Training on a proxy dataset was a suitable method for developing a highly performant classifier without labor intensive labeling efforts.
确定一组新冠肺炎病例,包括那些病毒阳性证据仅在临床文本中提及,而不在电子健康记录(EHR)的结构化实验室数据中的病例。
统计分类器基于患者电子健康记录(EHR)中非结构化文本的特征表示进行训练。我们使用患者新冠肺炎聚合酶链反应(PCR)检测的代理数据集进行训练。我们根据代理数据集上的性能选择了一个模型,并将其应用于没有新冠肺炎PCR检测的实例。一名医生对这些实例的样本进行了审查,以验证分类器。
在代理数据集的测试分割中,我们最好的分类器对严重急性呼吸综合征冠状病毒2(SARS-CoV2)阳性病例的F1得分为0.56,精确率为0.6,召回率为0.52。在专家验证中,分类器正确地将90.8%(79/87)识别为新冠肺炎阳性,将97.8%(91/93)识别为非SARS-CoV2阳性。该分类器识别出另外960例在医院没有进行SARS-CoV2实验室检测的阳性病例,其中只有177例具有新冠肺炎的国际疾病分类第十版(ICD-10)编码。
代理数据集的性能可能较差,因为这些实例有时包括对待处理实验室检测的讨论。最具预测性的特征是有意义且可解释的。很少提及所进行的外部检测类型。
可以从EHR文本中可靠地检测出在医院外进行检测的新冠肺炎病例。在代理数据集上进行训练是一种合适的方法,无需大量人工标注工作就能开发出高性能的分类器。