Datta Surabhi, Roberts Kirk
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
JAMIA Open. 2023 Apr 22;6(2):ooad027. doi: 10.1093/jamiaopen/ooad027. eCollection 2023 Jul.
Weak supervision holds significant promise to improve clinical natural language processing by leveraging domain resources and expertise instead of large manually annotated datasets alone. Here, our objective is to evaluate a weak supervision approach to extract spatial information from radiology reports.
Our weak supervision approach is based on data programming that uses rules (or labeling functions) relying on domain-specific dictionaries and radiology language characteristics to generate weak labels. The labels correspond to different spatial relations that are critical to understanding radiology reports. These weak labels are then used to fine-tune a pretrained Bidirectional Encoder Representations from Transformers (BERT) model.
Our weakly supervised BERT model provided satisfactory results in extracting spatial relations without manual annotations for training (spatial trigger F1: 72.89, relation F1: 52.47). When this model is further fine-tuned on manual annotations (relation F1: 68.76), performance surpasses the fully supervised state-of-the-art.
To our knowledge, this is the first work to automatically create detailed weak labels corresponding to radiological information of clinical significance. Our data programming approach is (1) adaptable as the labeling functions can be updated with relatively little manual effort to incorporate more variations in radiology language reporting formats and (2) generalizable as these functions can be applied across multiple radiology subdomains in most cases.
We demonstrate a weakly supervision model performs sufficiently well in identifying a variety of relations from radiology text without manual annotations, while exceeding state-of-the-art results when annotated data are available.
弱监督通过利用领域资源和专业知识而非仅依靠大型人工标注数据集,在改善临床自然语言处理方面具有巨大潜力。在此,我们的目标是评估一种从放射学报告中提取空间信息的弱监督方法。
我们的弱监督方法基于数据编程,该编程使用依赖于特定领域词典和放射学语言特征的规则(或标注函数)来生成弱标签。这些标签对应于理解放射学报告至关重要的不同空间关系。然后,这些弱标签用于微调预训练的来自变换器的双向编码器表征(BERT)模型。
我们的弱监督BERT模型在无需人工标注进行训练的情况下,在提取空间关系方面取得了令人满意的结果(空间触发F1值:72.89,关系F1值:52.47)。当该模型在人工标注上进一步微调时(关系F1值:68.76),性能超过了完全监督的当前最优方法。
据我们所知,这是第一项自动创建与具有临床意义的放射学信息相对应的详细弱标签的工作。我们的数据编程方法具有以下特点:(1)具有适应性,因为标注函数可以通过相对较少的人工努力进行更新,以纳入放射学语言报告格式中的更多变化;(2)具有通用性,因为在大多数情况下,这些函数可以应用于多个放射学子领域。
我们证明了一个弱监督模型在无需人工标注的情况下,从放射学文本中识别各种关系方面表现良好,而在有标注数据时,其性能超过了当前最优结果。