用于视频指称分割的目标无关变压器

Object-Agnostic Transformers for Video Referring Segmentation.

作者信息

Yang Xu, Wang Hao, Xie De, Deng Cheng, Tao Dacheng

出版信息

IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.

DOI:10.1109/TIP.2022.3161832

Abstract

Video referring segmentation focuses on segmenting out the object in a video based on the corresponding textual description. Previous works have primarily tackled this task by devising two crucial parts, an intra-modal module for context modeling and an inter-modal module for heterogeneous alignment. However, there are two essential drawbacks of this approach: (1) it lacks joint learning of context modeling and heterogeneous alignment, leading to insufficient interactions among input elements; (2) both modules require task-specific expert knowledge to design, which severely limits the flexibility and generality of prior methods. To address these problems, we here propose a novel Object-Agnostic Transformer-based Network, called OATNet, that simultaneously conducts intra-modal and inter-modal learning for video referring segmentation, without the aid of object detection or category-specific pixel labeling. More specifically, we first directly feed the sequence of textual tokens and visual tokens (pixels rather than detected object bounding boxes) into a multi-modal encoder, where context and alignment are simultaneously and effectively explored. We then design a novel cascade segmentation network to decouple our task into coarse-grained segmentation and fine-grained refinement. Moreover, considering the difficulty of samples, a more balanced metric is provided to better diagnose the performance of the proposed method. Extensive experiments on two popular datasets, A2D Sentences and J-HMDB Sentences, demonstrate that our proposed approach noticeably outperforms state-of-the-art methods.

摘要

视频指称分割专注于根据相应的文本描述在视频中分割出对象。先前的工作主要通过设计两个关键部分来解决此任务，一个用于上下文建模的模态内模块和一个用于异构对齐的模态间模块。然而，这种方法存在两个主要缺点：（1）它缺乏上下文建模和异构对齐的联合学习，导致输入元素之间的交互不足；（2）两个模块都需要特定任务的专家知识来设计，这严重限制了先前方法的灵活性和通用性。为了解决这些问题，我们在此提出一种基于对象无关变压器的新型网络，称为OATNet，它在不借助对象检测或特定类别像素标注的情况下，同时对视频指称分割进行模态内和模态间学习。更具体地说，我们首先将文本令牌和视觉令牌（像素而非检测到的对象边界框）的序列直接输入到一个多模态编码器中，在其中同时有效地探索上下文和对齐。然后，我们设计了一种新颖的级联分割网络，将我们的任务解耦为粗粒度分割和细粒度细化。此外，考虑到样本的难度，提供了一个更平衡的度量标准，以更好地诊断所提出方法的性能。在两个流行数据集A2D Sentences和J-HMDB Sentences上进行的大量实验表明，我们提出的方法明显优于现有方法。

相似文献

Object-Agnostic Transformers for Video Referring Segmentation.用于视频指称分割的目标无关变压器

IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器

Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.

Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.

Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.

Actor and Action Modular Network for Text-Based Video Segmentation.基于文本的视频分割的演员和动作模块化网络。

IEEE Trans Image Process. 2022;31:4474-4489. doi: 10.1109/TIP.2022.3185487. Epub 2022 Jul 1.

Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割

IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.

DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.DiagSWin：一种具有对角线形状窗口的多尺度视觉转换器，用于目标检测和分割。

Neural Netw. 2024 Dec;180:106653. doi: 10.1016/j.neunet.2024.106653. Epub 2024 Aug 22.

Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation.多模态互注意力和迭代交互的引用图像分割。

IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery.用于机器人手术中参考视频器械分割的视频-器械协同网络

IEEE Trans Med Imaging. 2024 Dec;43(12):4457-4469. doi: 10.1109/TMI.2024.3426953. Epub 2024 Dec 2.

One-Stage Visual Relationship Referring With Transformers and Adaptive Message Passing.

IEEE Trans Image Process. 2023;32:190-202. doi: 10.1109/TIP.2022.3226624. Epub 2022 Dec 19.

用于视频指称分割的目标无关变压器

Object-Agnostic Transformers for Video Referring Segmentation.

作者信息

Yang Xu, Wang Hao, Xie De, Deng Cheng, Tao Dacheng

出版信息

IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.

DOI:10.1109/TIP.2022.3161832

PMID:35349441

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于视频指称分割的目标无关变压器

Object-Agnostic Transformers for Video Referring Segmentation.

作者信息

出版信息

相似文献

用于视频指称分割的目标无关变压器

Object-Agnostic Transformers for Video Referring Segmentation.

作者信息

出版信息

相似文献