• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于指称视频对象分割的解耦跨模态变换器

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.

作者信息

Wu Ao, Wang Rong, Tan Quange, Song Zhenfeng

机构信息

School of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China.

Key Laboratory of Security Prevention Technology and Risk Assessment of Ministry of Public Security, Beijing 100038, China.

出版信息

Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.

DOI:10.3390/s24165375
PMID:39205068
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11359580/
Abstract

Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguistic features but fail to transfer the information of the two modalities to the query vector with balanced intensities. Furthermore, most of the traditional approaches suffer from severe information loss in the process of multi-scale feature fusion, resulting in inaccurate segmentation. In this paper, we propose DCT, an end-to-end decoupled cross-modal transformer for referring video object segmentation, to better utilize multi-modal and multi-scale information. Specifically, we first design a Language-Guided Visual Enhancement Module (LGVE) to transmit discriminative linguistic information to visual features of all levels, performing an initial filtering of irrelevant background regions. Then, we propose a decoupled transformer decoder, using a set of object queries to gather entity-related information from both visual and linguistic features independently, mitigating the attention bias caused by feature size differences. Finally, the Cross-layer Feature Pyramid Network (CFPN) is introduced to preserve more visual details by establishing direct cross-layer communication. Extensive experiments have been carried out on A2D-Sentences, JHMDB-Sentences and Ref-Youtube-VOS. The results show that DCT achieves competitive segmentation accuracy compared with the state-of-the-art methods.

摘要

指称视频对象分割(R-VOS)是一项基础的视觉语言任务,旨在对所有视频帧中语言表达所指的目标进行分割。现有的基于查询的R-VOS方法对视觉和语言特征之间的交互与对齐进行了深入探索,但未能将两种模态的信息以平衡的强度传递到查询向量中。此外,大多数传统方法在多尺度特征融合过程中存在严重的信息损失,导致分割不准确。在本文中,我们提出了DCT,一种用于指称视频对象分割的端到端解耦跨模态变换器,以更好地利用多模态和多尺度信息。具体而言,我们首先设计了一个语言引导视觉增强模块(LGVE),将有区分性的语言信息传递到各级视觉特征,对无关的背景区域进行初步过滤。然后,我们提出了一个解耦变换器解码器,使用一组对象查询分别从视觉和语言特征中收集与实体相关的信息,减轻由特征大小差异引起的注意力偏差。最后,引入跨层特征金字塔网络(CFPN),通过建立直接的跨层通信来保留更多视觉细节。我们在A2D-Sentences、JHMDB-Sentences和Ref-Youtube-VOS上进行了大量实验。结果表明,与现有最先进的方法相比,DCT取得了具有竞争力的分割精度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/ddbed4de02a6/sensors-24-05375-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/dd963aa7ddaa/sensors-24-05375-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/fdbfc0ecb449/sensors-24-05375-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/e23853d7ef01/sensors-24-05375-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/9ccd0dcd4e4e/sensors-24-05375-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/ddbed4de02a6/sensors-24-05375-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/dd963aa7ddaa/sensors-24-05375-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/fdbfc0ecb449/sensors-24-05375-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/e23853d7ef01/sensors-24-05375-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/9ccd0dcd4e4e/sensors-24-05375-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd5c/11359580/ddbed4de02a6/sensors-24-05375-g005.jpg

相似文献

1
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
2
Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.
3
Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.
4
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation.多模态互注意力和迭代交互的引用图像分割。
IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.
5
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.
6
Object-Agnostic Transformers for Video Referring Segmentation.用于视频指称分割的目标无关变压器
IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.
7
Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation.面向视频分割的语言感知时空协作
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8646-8659. doi: 10.1109/TPAMI.2023.3235720. Epub 2023 Jun 5.
8
A 3D hierarchical cross-modality interaction network using transformers and convolutions for brain glioma segmentation in MR images.一种使用变换和卷积的 3D 层次跨模态交互网络,用于磁共振图像中的脑胶质瘤分割。
Med Phys. 2024 Nov;51(11):8371-8389. doi: 10.1002/mp.17354. Epub 2024 Aug 13.
9
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.VLT:用于指代分割的视觉-语言转换器和查询生成。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7900-7916. doi: 10.1109/TPAMI.2022.3217852. Epub 2023 May 5.
10
SwinCross: Cross-modal Swin transformer for head-and-neck tumor segmentation in PET/CT images.SwinCross:用于 PET/CT 图像中头颈部肿瘤分割的跨模态 Swin 变换器。
Med Phys. 2024 Mar;51(3):2096-2107. doi: 10.1002/mp.16703. Epub 2023 Sep 30.

本文引用的文献

1
Actor and Action Modular Network for Text-Based Video Segmentation.基于文本的视频分割的演员和动作模块化网络。
IEEE Trans Image Process. 2022;31:4474-4489. doi: 10.1109/TIP.2022.3185487. Epub 2022 Jul 1.
2
Cross-Modal Progressive Comprehension for Referring Segmentation.跨模态递进式理解的指代分割。
IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):4761-4775. doi: 10.1109/TPAMI.2021.3079993. Epub 2022 Aug 4.
3
Cross-Layer Feature Pyramid Network for Salient Object Detection.用于显著目标检测的跨层特征金字塔网络
IEEE Trans Image Process. 2021;30:4587-4598. doi: 10.1109/TIP.2021.3072811. Epub 2021 Apr 29.
4
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.