Wu Ao, Wang Rong, Tan Quange, Song Zhenfeng
School of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China.
Key Laboratory of Security Prevention Technology and Risk Assessment of Ministry of Public Security, Beijing 100038, China.
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguistic features but fail to transfer the information of the two modalities to the query vector with balanced intensities. Furthermore, most of the traditional approaches suffer from severe information loss in the process of multi-scale feature fusion, resulting in inaccurate segmentation. In this paper, we propose DCT, an end-to-end decoupled cross-modal transformer for referring video object segmentation, to better utilize multi-modal and multi-scale information. Specifically, we first design a Language-Guided Visual Enhancement Module (LGVE) to transmit discriminative linguistic information to visual features of all levels, performing an initial filtering of irrelevant background regions. Then, we propose a decoupled transformer decoder, using a set of object queries to gather entity-related information from both visual and linguistic features independently, mitigating the attention bias caused by feature size differences. Finally, the Cross-layer Feature Pyramid Network (CFPN) is introduced to preserve more visual details by establishing direct cross-layer communication. Extensive experiments have been carried out on A2D-Sentences, JHMDB-Sentences and Ref-Youtube-VOS. The results show that DCT achieves competitive segmentation accuracy compared with the state-of-the-art methods.
指称视频对象分割(R-VOS)是一项基础的视觉语言任务,旨在对所有视频帧中语言表达所指的目标进行分割。现有的基于查询的R-VOS方法对视觉和语言特征之间的交互与对齐进行了深入探索,但未能将两种模态的信息以平衡的强度传递到查询向量中。此外,大多数传统方法在多尺度特征融合过程中存在严重的信息损失,导致分割不准确。在本文中,我们提出了DCT,一种用于指称视频对象分割的端到端解耦跨模态变换器,以更好地利用多模态和多尺度信息。具体而言,我们首先设计了一个语言引导视觉增强模块(LGVE),将有区分性的语言信息传递到各级视觉特征,对无关的背景区域进行初步过滤。然后,我们提出了一个解耦变换器解码器,使用一组对象查询分别从视觉和语言特征中收集与实体相关的信息,减轻由特征大小差异引起的注意力偏差。最后,引入跨层特征金字塔网络(CFPN),通过建立直接的跨层通信来保留更多视觉细节。我们在A2D-Sentences、JHMDB-Sentences和Ref-Youtube-VOS上进行了大量实验。结果表明,与现有最先进的方法相比,DCT取得了具有竞争力的分割精度。