Wang Lijing, Zipursky Amy R, Geva Alon, McMurry Andrew J, Mandl Kenneth D, Miller Timothy A
Department of Data Science, New Jersey Institute of Technology, Newark, New Jersey, USA.
Computational Health Informatics Program and Department of Emergency Medicine, Boston Children's Hospital, Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
JAMIA Open. 2023 Jul 5;6(3):ooad047. doi: 10.1093/jamiaopen/ooad047. eCollection 2023 Oct.
To identify a cohort of COVID-19 cases, including when evidence of virus positivity was only mentioned in the clinical text, not in structured laboratory data in the electronic health record (EHR).
Statistical classifiers were trained on feature representations derived from unstructured text in patient EHRs. We used a proxy dataset of patients COVID-19 polymerase chain reaction (PCR) tests for training. We selected a model based on performance on our proxy dataset and applied it to instances without COVID-19 PCR tests. A physician reviewed a sample of these instances to validate the classifier.
On the test split of the proxy dataset, our best classifier obtained 0.56 F1, 0.6 precision, and 0.52 recall scores for SARS-CoV2 positive cases. In an expert validation, the classifier correctly identified 97.6% (81/84) as COVID-19 positive and 97.8% (91/93) as not SARS-CoV2 positive. The classifier labeled an additional 960 cases as not having SARS-CoV2 lab tests in hospital, and only 177 of those cases had the ICD-10 code for COVID-19.
Proxy dataset performance may be worse because these instances sometimes include discussion of pending lab tests. The most predictive features are meaningful and interpretable. The type of external test that was performed is rarely mentioned.
COVID-19 cases that had testing done outside of the hospital can be reliably detected from the text in EHRs. Training on a proxy dataset was a suitable method for developing a highly performant classifier without labor-intensive labeling efforts.
确定一组新冠肺炎病例,包括病毒阳性证据仅在临床文本中提及,而不在电子健康记录(EHR)的结构化实验室数据中的病例。
统计分类器基于患者EHR中非结构化文本衍生的特征表示进行训练。我们使用患者新冠肺炎聚合酶链反应(PCR)检测的代理数据集进行训练。我们根据代理数据集上的性能选择了一个模型,并将其应用于没有新冠肺炎PCR检测的实例。一名医生对这些实例的一个样本进行了审查,以验证分类器。
在代理数据集的测试分割中,我们最好的分类器对严重急性呼吸综合征冠状病毒2(SARS-CoV2)阳性病例的F1得分为0.56,精确率为0.6,召回率为 0.52。在专家验证中,分类器正确识别出97.6%(81/84)为新冠肺炎阳性,97.8%(91/93)为非SARS-CoV2阳性。该分类器将另外960例病例标记为在医院没有进行SARS-CoV2实验室检测,其中只有177例病例具有新冠肺炎的国际疾病分类第十版(ICD-10)编码。
代理数据集的性能可能更差,因为这些实例有时包括对待处理实验室检测的讨论。最具预测性的特征是有意义且可解释的。很少提及所进行的外部检测类型。
可以从EHR文本中可靠地检测出在医院外进行检测的新冠肺炎病例。在代理数据集上进行训练是开发高性能分类器的一种合适方法,无需进行劳动密集型的标注工作。