Lybarger Kevin, Ostendorf Mari, Yetisgen Meliha
Biomedical & Health Informatics, University of Washington, Box 358047 Seattle, WA 98109, USA.
Department of Electrical & Computer Engineering, University of Washington, Campus Box 352500 185, Seattle, WA 98195-2500, USA.
J Biomed Inform. 2021 Jan;113:103631. doi: 10.1016/j.jbi.2020.103631. Epub 2020 Dec 5.
Social determinants of health (SDOH) affect health outcomes, and knowledge of SDOH can inform clinical decision-making. Automatically extracting SDOH information from clinical text requires data-driven information extraction models trained on annotated corpora that are heterogeneous and frequently include critical SDOH. This work presents a new corpus with SDOH annotations, a novel active learning framework, and the first extraction results on the new corpus. The Social History Annotation Corpus (SHAC) includes 4480 social history sections with detailed annotation for 12 SDOH characterizing the status, extent, and temporal information of 18K distinct events. We introduce a novel active learning framework that selects samples for annotation using a surrogate text classification task as a proxy for a more complex event extraction task. The active learning framework successfully increases the frequency of health risk factors and improves automatic extraction of these events over undirected annotation. An event extraction model trained on SHAC achieves high extraction performance for substance use status (0.82-0.93 F1), employment status (0.81-0.86 F1), and living status type (0.81-0.93 F1) on data from three institutions.
健康的社会决定因素(SDOH)会影响健康结果,而对SDOH的了解可为临床决策提供参考。从临床文本中自动提取SDOH信息需要基于经过注释的语料库训练的数据驱动信息提取模型,这些语料库具有异质性,且经常包含关键的SDOH。这项工作展示了一个带有SDOH注释的新语料库、一个新颖的主动学习框架以及新语料库上的首次提取结果。社会史注释语料库(SHAC)包含4480个社会史部分,对12个SDOH进行了详细注释,这些注释描述了18000个不同事件的状态、范围和时间信息。我们引入了一种新颖的主动学习框架,该框架使用代理文本分类任务来选择用于注释的样本,以此替代更复杂的事件提取任务。与无向注释相比,主动学习框架成功提高了健康风险因素的出现频率,并改善了这些事件的自动提取。在来自三个机构的数据上,基于SHAC训练的事件提取模型在物质使用状态(F1值为0.82 - 0.93)、就业状态(F1值为0.81 - 0.86)和居住状态类型(F1值为0.81 - 0.93)方面实现了较高的提取性能。