Liao Ruizhi, Moyer Daniel, Cha Miriam, Quigley Keegan, Berkowitz Seth, Horng Steven, Golland Polina, Wells William M
CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA.
MIT Lincoln Laboratory, Lexington, MA, USA.
Med Image Comput Comput Assist Interv. 2021 Sep-Oct;12902:273-283. doi: 10.1007/978-3-030-87196-3_26. Epub 2021 Sep 21.
We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning. Our code is available at: https://github.com/RayRuizhiLiao/mutual_info_img_txt.
我们提出并展示了一种通过最大化图像局部特征与文本之间的互信息来进行表示学习的方法。这种方法的目标是利用描述图像中发现的自由文本中包含的丰富信息来学习图像表示。我们的方法通过鼓励生成的表示展现出高局部互信息来训练图像和文本编码器。我们利用了神经网络判别器在互信息估计方面的最新进展。我们认为局部互信息之和通常是全局互信息的下限。我们在下游图像分类任务中的实验结果证明了使用局部特征进行图像 - 文本表示学习的优势。我们的代码可在以下网址获取:https://github.com/RayRuizhiLiao/mutual_info_img_txt 。