Datta Surabhi, Si Yuqi, Rodriguez Laritza, Shooshan Sonya E, Demner-Fushman Dina, Roberts Kirk
School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
J Biomed Inform. 2020 Aug;108:103473. doi: 10.1016/j.jbi.2020.103473. Epub 2020 Jun 18.
Radiology reports contain a radiologist's interpretations of images, and these images frequently describe spatial relations. Important radiographic findings are mostly described in reference to an anatomical location through spatial prepositions. Such spatial relationships are also linked to various differential diagnoses and often described through uncertainty phrases. Structured representation of this clinically significant spatial information has the potential to be used in a variety of downstream clinical informatics applications. Our focus is to extract these spatial representations from the reports. For this, we first define a representation framework based on the Spatial Role Labeling (SpRL) scheme, which we refer to as Rad-SpRL. In Rad-SpRL, common radiological entities tied to spatial relations are encoded through four spatial roles: Trajector, Landmark, Diagnosis, and Hedge, all identified in relation to a spatial preposition (or Spatial Indicator). We annotated a total of 2,000 chest X-ray reports following Rad-SpRL. We then propose a deep learning-based natural language processing (NLP) method involving word and character-level encodings to first extract the Spatial Indicators followed by identifying the corresponding spatial roles. Specifically, we use a bidirectional long short-term memory (Bi-LSTM) conditional random field (CRF) neural network as the baseline model. Additionally, we incorporate contextualized word representations from pre-trained language models (BERT and XLNet) for extracting the spatial information. We evaluate both gold and predicted Spatial Indicators to extract the four types of spatial roles. The results are promising, with the highest average F1 measure for Spatial Indicator extraction being 91.29 (XLNet); the highest average overall F1 measure considering all the four spatial roles being 92.9 using gold Indicators (XLNet); and 85.6 using predicted Indicators (BERT pre-trained on MIMIC notes). The corpus is available in Mendeley at http://dx.doi.org/10.17632/yhb26hfz8n.1 and https://github.com/krobertslab/datasets/blob/master/Rad-SpRL.xml.
放射学报告包含放射科医生对图像的解读,而这些图像常常描述空间关系。重要的放射学发现大多通过空间介词参照解剖位置来描述。此类空间关系也与各种鉴别诊断相关联,且常通过不确定性短语来描述。这种具有临床意义的空间信息的结构化表示有潜力应用于各种下游临床信息学应用中。我们的重点是从报告中提取这些空间表示。为此,我们首先基于空间角色标注(SpRL)方案定义一个表示框架,我们将其称为Rad-SpRL。在Rad-SpRL中,与空间关系相关的常见放射学实体通过四个空间角色进行编码:射体、地标、诊断和模糊词,所有这些都相对于一个空间介词(或空间指示词)来确定。我们按照Rad-SpRL对总共2000份胸部X光报告进行了标注。然后,我们提出一种基于深度学习的自然语言处理(NLP)方法,该方法涉及单词和字符级编码,首先提取空间指示词,然后识别相应的空间角色。具体而言,我们使用双向长短期记忆(Bi-LSTM)条件随机场(CRF)神经网络作为基线模型。此外,我们纳入来自预训练语言模型(BERT和XLNet)的上下文单词表示来提取空间信息。我们评估金标准和预测的空间指示词以提取四种类型的空间角色。结果很有前景,空间指示词提取的最高平均F1值为91.29(XLNet);考虑所有四个空间角色时,使用金标准指示词的最高平均总体F1值为92.9(XLNet);使用预测指示词时为85.6(在MIMIC注释上预训练的BERT)。该语料库可在Mendeley上获取,网址为http://dx.doi.org/10.17632/yhb26hfz8n.1和https://github.com/krobertslab/datasets/blob/master/Rad-SpRL.xml。