IEEE Trans Image Process. 2024;33:3256-3270. doi: 10.1109/TIP.2024.3394260. Epub 2024 May 8.
Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.
基于视频的指代消解是一项具有挑战性的任务,需要在给定视频的每一帧中定位所指对象。虽然许多现有的方法将这个任务视为一个目标跟踪问题,但它们的性能严重依赖于跟踪模板的质量。此外,当没有足够的注释数据来辅助模板选择时,跟踪可能会失败。其他方法基于目标检测,但它们通常只使用关键帧的一个相邻帧进行特征学习,这限制了它们建立不同帧之间关系的能力。此外,如何改进来自多个帧和指代表达式的特征融合,以有效地定位指代对象,仍然是一个悬而未决的问题。为了解决这些问题,我们提出了一种名为多阶段图像-语言交叉生成融合网络(MILCGF-Net)的新方法,该方法基于单阶段目标检测。我们的方法包括一个帧密集特征聚合模块,用于对相邻时间序列进行密集特征学习。此外,我们提出了一个图像-语言交叉生成融合模块作为多阶段学习的主体,通过计算视频和表达式之间的相似性来生成跨模态特征,然后对生成的特征进行细化和融合。为了进一步提高我们模型的跨模态特征生成能力,我们引入了一致性损失,在特征生成过程中约束图像-语言相似性和语言-图像相似性矩阵。我们在三个公共数据集上评估了我们提出的方法,并通过全面的实验结果证明了其有效性。