Suppr超能文献

基于视频的指称物理解中的多阶段图像-语言交叉生成融合网络。

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

出版信息

IEEE Trans Image Process. 2024;33:3256-3270. doi: 10.1109/TIP.2024.3394260. Epub 2024 May 8.

Abstract

Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

摘要

基于视频的指代消解是一项具有挑战性的任务,需要在给定视频的每一帧中定位所指对象。虽然许多现有的方法将这个任务视为一个目标跟踪问题,但它们的性能严重依赖于跟踪模板的质量。此外,当没有足够的注释数据来辅助模板选择时,跟踪可能会失败。其他方法基于目标检测,但它们通常只使用关键帧的一个相邻帧进行特征学习,这限制了它们建立不同帧之间关系的能力。此外,如何改进来自多个帧和指代表达式的特征融合,以有效地定位指代对象,仍然是一个悬而未决的问题。为了解决这些问题,我们提出了一种名为多阶段图像-语言交叉生成融合网络(MILCGF-Net)的新方法,该方法基于单阶段目标检测。我们的方法包括一个帧密集特征聚合模块,用于对相邻时间序列进行密集特征学习。此外,我们提出了一个图像-语言交叉生成融合模块作为多阶段学习的主体,通过计算视频和表达式之间的相似性来生成跨模态特征,然后对生成的特征进行细化和融合。为了进一步提高我们模型的跨模态特征生成能力,我们引入了一致性损失,在特征生成过程中约束图像-语言相似性和语言-图像相似性矩阵。我们在三个公共数据集上评估了我们提出的方法,并通过全面的实验结果证明了其有效性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验