McInerney Denis Jered, Young Geoffrey, van de Meent Jan-Willem, Wallace Byron C
Northeastern University.
Brigham and Women's Hospital.
Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3626-3648.
Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention "heatmaps" can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications - such as substituting "left" for "right" - do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision. We make our code and checkpoints open-source.
在电子健康记录(EHR)上预训练多模态模型提供了一种学习表征的方法,这种表征可以在最少监督的情况下迁移到下游任务。最近的多模态模型在图像区域和句子之间诱导出软局部对齐。这在医学领域特别受关注,因为对齐可能会突出图像中与自由文本中描述的特定现象相关的区域。虽然过去的工作表明注意力“热图”可以以这种方式解释,但对这种对齐的评估却很少。我们将一个用于EHR的先进多模态(图像和文本)模型的对齐与将图像区域与句子联系起来的人工注释进行比较。我们的主要发现是,文本对注意力的影响通常较弱或不直观;对齐并不能始终如一地反映基本的解剖学信息。此外,合成修改——比如将“左”替换为“右”——对突出显示的影响不大。诸如允许模型选择不关注图像以及少样本微调等简单技术,在几乎没有监督或完全没有监督的情况下,在改善对齐方面显示出了潜力。我们将代码和检查点开源。