Biomedical & Health Informatics, University of Washington, Box 358047, Seattle, WA 98109, USA.
Department of Electrical & Computer Engineering, University of Washington, Campus Box 352500 185, Seattle, WA 98195-2500, USA.
J Biomed Inform. 2021 May;117:103761. doi: 10.1016/j.jbi.2021.103761. Epub 2021 Mar 26.
Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). Our span-based event extraction model outperforms an extractor built on MetaMapLite for the identification of symptoms with assertion values. In a secondary use application, we predicted COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information, to explore the clinical presentation of COVID-19. Automatically extracted symptoms improve COVID-19 prediction performance, beyond structured data alone.
新型冠状病毒肺炎(COVID-19)是一种全球性的大流行病。自从新型冠状病毒出现以来,人们已经对其有了很多了解,但仍有许多悬而未决的问题,涉及追踪其传播、描述症状、预测感染严重程度和预测医疗保健利用等方面。自由文本临床记录包含了用于解决这些问题的关键信息。需要数据驱动的自动信息提取模型,以便在大规模研究中使用这些文本编码信息。这项工作提出了一个新的临床语料库,称为 COVID-19 注释临床文本(CACT)语料库,它包含 1472 个带有详细注释的笔记,这些注释特征化了 COVID-19 的诊断、检测和临床表现。我们引入了一种基于跨度的事件抽取模型,该模型可以联合抽取所有已注释的现象,在识别 COVID-19 和症状事件方面取得了很高的性能,其关联断言值的 F1 值为 0.83-0.97(事件)和 0.73-0.79(断言)。我们的基于跨度的事件抽取模型在识别具有断言值的症状方面优于基于 MetaMapLite 的抽取器。在二次使用应用程序中,我们使用结构化患者数据(例如生命体征和实验室结果)和自动提取的症状信息来预测 COVID-19 检测结果,以探索 COVID-19 的临床表现。自动提取的症状可提高 COVID-19 预测性能,优于仅使用结构化数据。