• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

跨模态递进式理解的指代分割。

Cross-Modal Progressive Comprehension for Referring Segmentation.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):4761-4775. doi: 10.1109/TPAMI.2021.3079993. Epub 2022 Aug 4.

DOI:10.1109/TPAMI.2021.3079993
PMID:33983880
Abstract

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a cross-modal progressive comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively. Our code is available at https://github.com/spyflying/CMPC-Refseg.

摘要

给定一个自然语言表达式和一个图像/视频,指称分割的目标是生成表达式的主题所描述的实体的像素级掩模。以前的方法通过在一个阶段内对视觉和语言模态进行隐式特征交互和融合来解决这个问题。然而,人类倾向于根据表达式中的信息词,以渐进的方式解决指称问题,即首先大致定位候选实体,然后区分目标实体。在本文中,我们提出了一种跨模态渐进理解(CMPC)方案,以有效地模拟人类行为,并将其实现为一个 CMPC-I(图像)模块和一个 CMPC-V(视频)模块,以改进指称图像和视频分割模型。对于图像数据,我们的 CMPC-I 模块首先使用实体和属性词感知表达式可能考虑的所有相关实体。然后,采用关系词通过空间图推理突出目标实体,并抑制其他不相关的实体。对于视频数据,我们的 CMPC-V 模块还基于 CMPC-I 利用动作词通过时间图推理突出与动作线索匹配的正确实体。除了 CMPC,我们还引入了一个简单而有效的文本引导特征交换(TGFE)模块,以在文本信息的指导下,整合视觉骨干中不同层次对应的推理多模态特征。通过这种方式,多层次特征可以相互交流,并根据文本上下文相互细化。将 CMPC-I 或 CMPC-V 与 TGFE 结合可以形成我们的图像或视频版本的指称分割框架,我们的框架在四个指称图像分割基准和三个指称视频分割基准上分别取得了新的最先进的性能。我们的代码可在 https://github.com/spyflying/CMPC-Refseg 获得。

相似文献

1
Cross-Modal Progressive Comprehension for Referring Segmentation.跨模态递进式理解的指代分割。
IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):4761-4775. doi: 10.1109/TPAMI.2021.3079993. Epub 2022 Aug 4.
2
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.
3
Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.
4
Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation.面向视频分割的语言感知时空协作
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8646-8659. doi: 10.1109/TPAMI.2023.3235720. Epub 2023 Jun 5.
5
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
6
Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.
7
Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.基于视频的指称物理解中的多阶段图像-语言交叉生成融合网络。
IEEE Trans Image Process. 2024;33:3256-3270. doi: 10.1109/TIP.2024.3394260. Epub 2024 May 8.
8
Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation.用于指称图像定位与分割的双向关系推理网络
IEEE Trans Neural Netw Learn Syst. 2023 May;34(5):2246-2258. doi: 10.1109/TNNLS.2021.3106153. Epub 2023 May 2.
9
Unambiguous Scene Text Segmentation with Referring Expression Comprehension.结合指代表达理解的明确场景文本分割
IEEE Trans Image Process. 2019 Jul 26. doi: 10.1109/TIP.2019.2930176.
10
Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery.用于机器人手术中参考视频器械分割的视频-器械协同网络
IEEE Trans Med Imaging. 2024 Dec;43(12):4457-4469. doi: 10.1109/TMI.2024.3426953. Epub 2024 Dec 2.

引用本文的文献

1
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
2
Absolute and Relative Depth-Induced Network for RGB-D Salient Object Detection.基于绝对和相对深度信息的 RGB-D 显著目标检测网络
Sensors (Basel). 2023 Mar 30;23(7):3611. doi: 10.3390/s23073611.