Suppr超能文献

通过掩蔽建模将医学图像-文本映射到联合空间。

Mapping medical image-text to a joint space via masked modeling.

机构信息

The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.

Sun Yat-sen University, Guangzhou, 510275, China.

出版信息

Med Image Anal. 2024 Jan;91:103018. doi: 10.1016/j.media.2023.103018. Epub 2023 Nov 4.

Abstract

Recently, masked autoencoders have demonstrated their feasibility in extracting effective image and text features (e.g., BERT for natural language processing (NLP) and MAE in computer vision (CV)). This study investigates the potential of applying these techniques to vision-and-language representation learning in the medical domain. To this end, we introduce a self-supervised learning paradigm, multi-modal masked autoencoders (MAE). It learns to map medical images and texts to a joint space by reconstructing pixels and tokens from randomly masked images and texts. Specifically, we design this approach from three aspects: First, taking into account the varying information densities of vision and language, we employ distinct masking ratios for input images and text, with a notably higher masking ratio for images; Second, we utilize visual and textual features from different layers for reconstruction to address varying levels of abstraction in vision and language; Third, we develop different designs for vision and language decoders. We establish a medical vision-and-language benchmark to conduct an extensive evaluation. Our experimental results exhibit the effectiveness of the proposed method, achieving state-of-the-art results on all downstream tasks. Further analyses validate the effectiveness of the various components and discuss the limitations of the proposed approach. The source code is available at https://github.com/zhjohnchan/M3AE.

摘要

最近,掩蔽自动编码器已经证明了它们在提取有效的图像和文本特征方面的可行性(例如,自然语言处理(NLP)中的 BERT 和计算机视觉(CV)中的 MAE)。本研究探讨了将这些技术应用于医学领域的视觉-语言表示学习的潜力。为此,我们引入了一种自监督学习范式,即多模态掩蔽自动编码器(MAE)。它通过从随机掩蔽的图像和文本中重建像素和标记来学习将医学图像和文本映射到联合空间。具体来说,我们从三个方面设计了这种方法:首先,考虑到视觉和语言的信息密度不同,我们为输入图像和文本采用了不同的掩蔽比例,图像的掩蔽比例明显更高;其次,我们利用不同层的视觉和文本特征进行重建,以解决视觉和语言中不同层次的抽象;第三,我们为视觉和语言解码器设计了不同的方案。我们建立了一个医学视觉-语言基准来进行广泛的评估。我们的实验结果表明了所提出方法的有效性,在所有下游任务上都取得了最先进的结果。进一步的分析验证了各个组件的有效性,并讨论了所提出方法的局限性。源代码可在 https://github.com/zhjohnchan/M3AE 获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验