Yang Xu, Wang Hao, Xie De, Deng Cheng, Tao Dacheng
IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.
Video referring segmentation focuses on segmenting out the object in a video based on the corresponding textual description. Previous works have primarily tackled this task by devising two crucial parts, an intra-modal module for context modeling and an inter-modal module for heterogeneous alignment. However, there are two essential drawbacks of this approach: (1) it lacks joint learning of context modeling and heterogeneous alignment, leading to insufficient interactions among input elements; (2) both modules require task-specific expert knowledge to design, which severely limits the flexibility and generality of prior methods. To address these problems, we here propose a novel Object-Agnostic Transformer-based Network, called OATNet, that simultaneously conducts intra-modal and inter-modal learning for video referring segmentation, without the aid of object detection or category-specific pixel labeling. More specifically, we first directly feed the sequence of textual tokens and visual tokens (pixels rather than detected object bounding boxes) into a multi-modal encoder, where context and alignment are simultaneously and effectively explored. We then design a novel cascade segmentation network to decouple our task into coarse-grained segmentation and fine-grained refinement. Moreover, considering the difficulty of samples, a more balanced metric is provided to better diagnose the performance of the proposed method. Extensive experiments on two popular datasets, A2D Sentences and J-HMDB Sentences, demonstrate that our proposed approach noticeably outperforms state-of-the-art methods.
视频指称分割专注于根据相应的文本描述在视频中分割出对象。先前的工作主要通过设计两个关键部分来解决此任务,一个用于上下文建模的模态内模块和一个用于异构对齐的模态间模块。然而,这种方法存在两个主要缺点:(1)它缺乏上下文建模和异构对齐的联合学习,导致输入元素之间的交互不足;(2)两个模块都需要特定任务的专家知识来设计,这严重限制了先前方法的灵活性和通用性。为了解决这些问题,我们在此提出一种基于对象无关变压器的新型网络,称为OATNet,它在不借助对象检测或特定类别像素标注的情况下,同时对视频指称分割进行模态内和模态间学习。更具体地说,我们首先将文本令牌和视觉令牌(像素而非检测到的对象边界框)的序列直接输入到一个多模态编码器中,在其中同时有效地探索上下文和对齐。然后,我们设计了一种新颖的级联分割网络,将我们的任务解耦为粗粒度分割和细粒度细化。此外,考虑到样本的难度,提供了一个更平衡的度量标准,以更好地诊断所提出方法的性能。在两个流行数据集A2D Sentences和J-HMDB Sentences上进行的大量实验表明,我们提出的方法明显优于现有方法。