Ji Zhanghexuan, Shaikh Mohammad Abuzar, Moukheiber Dana, Srihari Sargur N, Peng Yifan, Gao Mingchen
Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA.
Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
Mach Learn Med Imaging. 2021 Sep;12966:110-119. doi: 10.1007/978-3-030-87589-3_12. Epub 2021 Sep 21.
Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching. Both are bidirectionally constrained on Cross-Entropy based and ranking-based Triplet Matching Losses. The region-word matching is calculated using the attention mechanism without direct supervision about their mapping. The pre-trained multi-modal representation learning paves the way for downstream tasks concerning image and/or text encoding. We demonstrate the representation learning quality by cross-modality retrievals and multi-label classifications on two datasets: OpenI-IU and MIMIC-CXR. Our code is available at https://github.com/mshaikh2/JoImTeR_MLMI_2021.
自监督学习提供了一个机会,可以在无需人工监督的情况下,探索临床常规中积累的未标记胸部X光片及其相关的自由文本报告。本文提出了一种联合图像文本表示学习网络(JoImTeRNet),用于对胸部X光图像及其放射学报告进行预训练。该模型在全局图像-句子级别和局部图像区域-单词级别上进行预训练,以实现视觉-文本匹配。两者都基于交叉熵和基于排序的三元组匹配损失进行双向约束。区域-单词匹配是使用注意力机制计算的,无需对其映射进行直接监督。预训练的多模态表示学习为涉及图像和/或文本编码的下游任务铺平了道路。我们通过在两个数据集OpenI-IU和MIMIC-CXR上进行跨模态检索和多标签分类来展示表示学习的质量。我们的代码可在https://github.com/mshaikh2/JoImTeR_MLMI_2021获取。