Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, Texas, USA.
J Am Med Inform Assoc. 2023 Jul 19;30(8):1398-1407. doi: 10.1093/jamia/ocad041.
The impact of social determinants of health (SDoH) on patients' healthcare quality and the disparity is well known. Many SDoH items are not coded in structured forms in electronic health records. These items are often captured in free-text clinical notes, but there are limited methods for automatically extracting them. We explore a multi-stage pipeline involving named entity recognition (NER), relation classification (RC), and text classification methods to automatically extract SDoH information from clinical notes.
The study uses the N2C2 Shared Task data, which were collected from 2 sources of clinical notes: MIMIC-III and University of Washington Harborview Medical Centers. It contains 4480 social history sections with full annotation for 12 SDoHs. In order to handle the issue of overlapping entities, we developed a novel marker-based NER model. We used it in a multi-stage pipeline to extract SDoH information from clinical notes.
Our marker-based system outperformed the state-of-the-art span-based models at handling overlapping entities based on the overall Micro-F1 score performance. It also achieved state-of-the-art performance compared with the shared task methods. Our approach achieved an F1 of 0.9101, 0.8053, and 0.9025 for Subtasks A, B, and C, respectively.
The major finding of this study is that the multi-stage pipeline effectively extracts SDoH information from clinical notes. This approach can improve the understanding and tracking of SDoHs in clinical settings. However, error propagation may be an issue and further research is needed to improve the extraction of entities with complex semantic meanings and low-frequency entities. We have made the source code available at https://github.com/Zephyr1022/SDOH-N2C2-UTSA.
社会决定因素(SDoH)对患者医疗质量的影响及其差异是众所周知的。许多 SDoH 项目在电子健康记录中没有以结构化形式编码。这些项目通常在自由文本临床记录中捕获,但自动提取它们的方法有限。我们探索了一个多阶段的管道,包括命名实体识别(NER)、关系分类(RC)和文本分类方法,以自动从临床记录中提取 SDoH 信息。
该研究使用了 N2C2 共享任务数据,这些数据来自两个临床记录来源:MIMIC-III 和华盛顿大学港景医疗中心。它包含 4480 个社会历史部分,对 12 个 SDoH 进行了全面注释。为了解决重叠实体的问题,我们开发了一种基于标记的 NER 模型。我们在多阶段管道中使用它从临床记录中提取 SDoH 信息。
我们的基于标记的系统在处理重叠实体方面优于基于跨度的最先进模型,基于整体 Micro-F1 得分表现。与共享任务方法相比,它也取得了最先进的性能。我们的方法在子任务 A、B 和 C 中分别实现了 0.9101、0.8053 和 0.9025 的 F1。
本研究的主要发现是,多阶段管道有效地从临床记录中提取 SDoH 信息。这种方法可以提高对临床环境中 SDoH 的理解和跟踪。然而,错误传播可能是一个问题,需要进一步研究以提高对具有复杂语义含义和低频率实体的实体的提取。我们已经在 https://github.com/Zephyr1022/SDOH-N2C2-UTSA 上提供了源代码。