通过掩蔽建模将医学图像-文本映射到联合空间。

Mapping medical image-text to a joint space via masked modeling.

机构信息

The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.

Sun Yat-sen University, Guangzhou, 510275, China.

出版信息

Med Image Anal. 2024 Jan;91:103018. doi: 10.1016/j.media.2023.103018. Epub 2023 Nov 4.

DOI:10.1016/j.media.2023.103018

PMID:37976867

Abstract

Recently, masked autoencoders have demonstrated their feasibility in extracting effective image and text features (e.g., BERT for natural language processing (NLP) and MAE in computer vision (CV)). This study investigates the potential of applying these techniques to vision-and-language representation learning in the medical domain. To this end, we introduce a self-supervised learning paradigm, multi-modal masked autoencoders (MAE). It learns to map medical images and texts to a joint space by reconstructing pixels and tokens from randomly masked images and texts. Specifically, we design this approach from three aspects: First, taking into account the varying information densities of vision and language, we employ distinct masking ratios for input images and text, with a notably higher masking ratio for images; Second, we utilize visual and textual features from different layers for reconstruction to address varying levels of abstraction in vision and language; Third, we develop different designs for vision and language decoders. We establish a medical vision-and-language benchmark to conduct an extensive evaluation. Our experimental results exhibit the effectiveness of the proposed method, achieving state-of-the-art results on all downstream tasks. Further analyses validate the effectiveness of the various components and discuss the limitations of the proposed approach. The source code is available at https://github.com/zhjohnchan/M3AE.

摘要

最近，掩蔽自动编码器已经证明了它们在提取有效的图像和文本特征方面的可行性（例如，自然语言处理（NLP）中的 BERT 和计算机视觉（CV）中的 MAE）。本研究探讨了将这些技术应用于医学领域的视觉-语言表示学习的潜力。为此，我们引入了一种自监督学习范式，即多模态掩蔽自动编码器（MAE）。它通过从随机掩蔽的图像和文本中重建像素和标记来学习将医学图像和文本映射到联合空间。具体来说，我们从三个方面设计了这种方法：首先，考虑到视觉和语言的信息密度不同，我们为输入图像和文本采用了不同的掩蔽比例，图像的掩蔽比例明显更高；其次，我们利用不同层的视觉和文本特征进行重建，以解决视觉和语言中不同层次的抽象；第三，我们为视觉和语言解码器设计了不同的方案。我们建立了一个医学视觉-语言基准来进行广泛的评估。我们的实验结果表明了所提出方法的有效性，在所有下游任务上都取得了最先进的结果。进一步的分析验证了各个组件的有效性，并讨论了所提出方法的局限性。源代码可在 https://github.com/zhjohnchan/M3AE 获得。

相似文献

Mapping medical image-text to a joint space via masked modeling.通过掩蔽建模将医学图像-文本映射到联合空间。

Med Image Anal. 2024 Jan;91:103018. doi: 10.1016/j.media.2023.103018. Epub 2023 Nov 4.

Rethinking masked image modelling for medical image representation.重新思考医学图像表示的掩模图像建模。

Med Image Anal. 2024 Dec;98:103304. doi: 10.1016/j.media.2024.103304. Epub 2024 Aug 17.

MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model.MCPL：用于医学视觉语言模型的多模态协作提示学习

IEEE Trans Med Imaging. 2024 Dec;43(12):4224-4235. doi: 10.1109/TMI.2024.3418408. Epub 2024 Dec 2.

Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.通过视觉语言预训练实现医学图像与文本的多模态理解与生成

IEEE J Biomed Health Inform. 2022 Dec;26(12):6070-6080. doi: 10.1109/JBHI.2022.3207502. Epub 2022 Dec 7.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

MAE-TransRNet: An improved transformer-ConvNet architecture with masked autoencoder for cardiac MRI registration.MAE-TransRNet：一种用于心脏磁共振成像配准的、带有掩码自动编码器的改进型Transformer-ConvNet架构。

Front Med (Lausanne). 2023 Mar 9;10:1114571. doi: 10.3389/fmed.2023.1114571. eCollection 2023.

Benchmarking and Boosting Transformers for Medical Image Classification.用于医学图像分类的基准测试与增强变换器

Domain Adapt Represent Transf (2022). 2022 Sep;13542:12-22. doi: 10.1007/978-3-031-16852-9_2. Epub 2022 Sep 15.

Hybrid Masked Image Modeling for 3D Medical Image Segmentation.混合掩模图像建模在 3D 医学图像分割中的应用。

IEEE J Biomed Health Inform. 2024 Apr;28(4):2115-2125. doi: 10.1109/JBHI.2024.3360239. Epub 2024 Apr 4.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Multi-Scale Hybrid Fusion Network for Single Image Deraining.用于单图像去雨的多尺度混合融合网络

IEEE Trans Neural Netw Learn Syst. 2023 Jul;34(7):3594-3608. doi: 10.1109/TNNLS.2021.3112235. Epub 2023 Jul 6.

引用本文的文献

A Systematic Review and Implementation Guidelines of Multimodal Foundation Models in Medical Imaging.医学影像中多模态基础模型的系统评价与实施指南

Res Sq. 2025 Apr 28:rs.3.rs-5537908. doi: 10.21203/rs.3.rs-5537908/v1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过掩蔽建模将医学图像-文本映射到联合空间。

Mapping medical image-text to a joint space via masked modeling.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献