Datta Surabhi, Roberts Kirk
School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
Int J Med Inform. 2021 Nov 6;158:104628. doi: 10.1016/j.ijmedinf.2021.104628.
Radiology reports contain important clinical information that can be used to automatically construct fine-grained labels for applications requiring deep phenotyping. We propose a two-turn question answering (QA) method based on a transformer language model, BERT, for extracting detailed spatial information from radiology reports. We aim to demonstrate the advantage that a multi-turn QA framework provides over sequence-based methods for extracting fine-grained information.
Our proposed method identifies spatial and descriptor information by answering queries given a radiology report text. We frame the extraction problem such that all the main radiology entities (e.g., finding, device, anatomy) and the spatial trigger terms (denoting the presence of a spatial relation between finding/device and anatomical location) are identified in the first turn. In the subsequent turn, various other contextual information that acts as important spatial roles with respect to a spatial trigger term are extracted along with identifying the spatial and other descriptor terms qualifying a radiological entity. The queries are constructed using separate templates for the two turns and we employ two query variations in the second turn.
When compared to the best-reported work on this task using a traditional sequence tagging method, the two-turn QA model exceeds its performance on every component. This includes promising improvements of 12, 13, and 12 points in the average F1 scores for identifying the spatial triggers, Figure, and Ground frame elements, respectively.
Our experiments suggest that incorporating domain knowledge in the query (a general description about a frame element) helps in obtaining better results for some of the spatial and descriptive frame elements, especially in the case of the clinical pre-trained BERT model. We further highlight that the two-turn QA approach fits well for extracting information for complex schema where the objective is to identify all the frame elements linked to each spatial trigger and finding/device/anatomy entity, thereby enabling the extraction of more comprehensive information in the radiology domain.
Extracting fine-grained spatial information from text in the form of answering natural language queries holds potential in achieving better results when compared to more standard sequence labeling-based approaches.
放射学报告包含重要的临床信息,可用于为需要深度表型分析的应用自动构建细粒度标签。我们提出了一种基于变压器语言模型BERT的两阶段问答(QA)方法,用于从放射学报告中提取详细的空间信息。我们旨在证明多阶段QA框架在提取细粒度信息方面比基于序列的方法具有优势。
我们提出的方法通过回答给定放射学报告文本的查询来识别空间和描述符信息。我们构建提取问题,以便在第一阶段识别所有主要的放射学实体(例如,发现、设备、解剖结构)和空间触发词(表示发现/设备与解剖位置之间存在空间关系)。在随后的阶段,除了识别限定放射学实体的空间和其他描述符术语外,还提取相对于空间触发词起重要空间作用的各种其他上下文信息。这两个阶段使用单独的模板构建查询,并且在第二阶段我们采用两种查询变体。
与使用传统序列标记方法在该任务上报告的最佳工作相比,两阶段QA模型在每个组件上都超过了其性能。这包括在识别空间触发词、图和地框架元素的平均F1分数方面分别有12、13和12分的显著提高。
我们的实验表明,在查询中纳入领域知识(关于框架元素的一般描述)有助于为一些空间和描述性框架元素获得更好的结果,特别是在临床预训练的BERT模型的情况下。我们进一步强调,两阶段QA方法非常适合为复杂模式提取信息,其目标是识别与每个空间触发词以及发现/设备/解剖结构实体相关联的所有框架元素,从而能够在放射学领域提取更全面的信息。
与更标准的基于序列标记的方法相比,以回答自然语言查询的形式从文本中提取细粒度空间信息具有取得更好结果的潜力。