Nuance Communications, Burlington, Massachusetts, USA.
Microsoft, Redmond, Washington, USA.
J Am Med Inform Assoc. 2023 Jul 19;30(8):1448-1455. doi: 10.1093/jamia/ocad071.
Social determinants of health (SDOH) are nonmedical factors that can influence health outcomes. This paper seeks to extract SDOH from clinical texts in the context of the National NLP Clinical Challenges (n2c2) 2022 Track 2 Task.
Annotated and unannotated data from the Medical Information Mart for Intensive Care III (MIMIC-III) corpus, the Social History Annotation Corpus, and an in-house corpus were used to develop 2 deep learning models that used classification and sequence-to-sequence (seq2seq) approaches.
The seq2seq approach had the highest overall F1 scores in the challenge's 3 subtasks: 0.901 on the extraction subtask, 0.774 on the generalizability subtask, and 0.889 on the learning transfer subtask.
Both approaches rely on SDOH event representations that were designed to be compatible with transformer-based pretrained models, with the seq2seq representation supporting an arbitrary number of overlapping and sentence-spanning events. Models with adequate performance could be produced quickly, and the remaining mismatch between representation and task requirements was then addressed in postprocessing. The classification approach used rules to generate entity relationships from its sequence of token labels, while the seq2seq approach used constrained decoding and a constraint solver to recover entity text spans from its sequence of potentially ambiguous tokens.
We proposed 2 different approaches to extract SDOH from clinical texts with high accuracy. However, accuracy suffers on text from new healthcare institutions not present in the training data, and thus generalization remains an important topic for future study.
健康的社会决定因素(SDOH)是非医学因素,可影响健康结果。本文旨在从 National NLP Clinical Challenges (n2c2) 2022 第 2 轨道 2 任务的临床文本中提取 SDOH。
使用来自 Medical Information Mart for Intensive Care III (MIMIC-III) 语料库、社会历史标注语料库和内部语料库的标注和未标注数据,开发了 2 种深度学习模型,分别使用分类和序列到序列(seq2seq)方法。
seq2seq 方法在挑战的 3 个子任务中总体 F1 得分最高:提取子任务为 0.901,泛化子任务为 0.774,学习迁移子任务为 0.889。
这两种方法都依赖于旨在与基于转换器的预训练模型兼容的 SDOH 事件表示,seq2seq 表示支持任意数量的重叠和跨句事件。可以快速生成具有足够性能的模型,然后在后期处理中解决表示和任务要求之间的剩余不匹配问题。分类方法使用规则从其令牌标签序列生成实体关系,而 seq2seq 方法使用约束解码和约束求解器从其潜在歧义令牌的序列中恢复实体文本跨度。
我们提出了 2 种从临床文本中准确提取 SDOH 的不同方法。然而,在训练数据中不存在的新医疗机构的文本上,准确性会受到影响,因此泛化仍然是未来研究的一个重要课题。