• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多模态互注意力和迭代交互的引用图像分割。

Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation.

出版信息

IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.

DOI:10.1109/TIP.2023.3277791
PMID:37220044
Abstract

We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention (M3Att) and Multi-Modal Mutual Decoder ( MDec ) that better fuse information from the two input modalities. Based on MDec , we further propose Iterative Multi-modal Interaction (IMI) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction (LFR) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.

摘要

我们解决了指代图像分割的问题,该问题旨在通过自然语言表达生成目标对象的掩码。许多最近的工作都利用 Transformer 通过聚合关注的视觉区域来提取目标对象的特征。然而,Transformer 中的通用注意力机制仅使用语言输入来计算注意力权重,而不会在其输出中显式融合语言特征。因此,其输出特征主要由视觉信息主导,这限制了模型全面理解多模态信息的能力,并为后续的掩码解码器提取输出掩码带来不确定性。为了解决这个问题,我们提出了多模态互注意力(M3Att)和多模态互解码器(MDec),它们可以更好地融合来自两种输入模态的信息。基于 MDec,我们进一步提出了迭代多模态交互(IMI),以允许语言和视觉特征之间的连续和深入交互。此外,我们引入了语言特征重建(LFR),以防止语言信息在提取的特征中丢失或扭曲。广泛的实验表明,我们提出的方法在 RefCOCO 系列数据集上显著优于基线,并始终优于最先进的指代图像分割方法。

相似文献

1
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation.多模态互注意力和迭代交互的引用图像分割。
IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.
2
Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.
3
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.VLT:用于指代分割的视觉-语言转换器和查询生成。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7900-7916. doi: 10.1109/TPAMI.2022.3217852. Epub 2023 May 5.
4
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
5
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.
6
Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.
7
Automated multi-modal Transformer network (AMTNet) for 3D medical images segmentation.用于3D医学图像分割的自动多模态Transformer网络(AMTNet)。
Phys Med Biol. 2023 Jan 9;68(2). doi: 10.1088/1361-6560/aca74c.
8
Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.用于指称表达理解的自定步长多粒度跨模态交互建模
IEEE Trans Image Process. 2024;33:1497-1507. doi: 10.1109/TIP.2023.3334099. Epub 2024 Feb 21.
9
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么:语义信息对视觉表征的影响。
Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.
10
Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation.用于指称图像定位与分割的双向关系推理网络
IEEE Trans Neural Netw Learn Syst. 2023 May;34(5):2246-2258. doi: 10.1109/TNNLS.2021.3106153. Epub 2023 May 2.