基于视频的指称物理解中的多阶段图像-语言交叉生成融合网络。

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

出版信息

IEEE Trans Image Process. 2024;33:3256-3270. doi: 10.1109/TIP.2024.3394260. Epub 2024 May 8.

DOI:10.1109/TIP.2024.3394260

Abstract

Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

摘要

基于视频的指代消解是一项具有挑战性的任务，需要在给定视频的每一帧中定位所指对象。虽然许多现有的方法将这个任务视为一个目标跟踪问题，但它们的性能严重依赖于跟踪模板的质量。此外，当没有足够的注释数据来辅助模板选择时，跟踪可能会失败。其他方法基于目标检测，但它们通常只使用关键帧的一个相邻帧进行特征学习，这限制了它们建立不同帧之间关系的能力。此外，如何改进来自多个帧和指代表达式的特征融合，以有效地定位指代对象，仍然是一个悬而未决的问题。为了解决这些问题，我们提出了一种名为多阶段图像-语言交叉生成融合网络（MILCGF-Net）的新方法，该方法基于单阶段目标检测。我们的方法包括一个帧密集特征聚合模块，用于对相邻时间序列进行密集特征学习。此外，我们提出了一个图像-语言交叉生成融合模块作为多阶段学习的主体，通过计算视频和表达式之间的相似性来生成跨模态特征，然后对生成的特征进行细化和融合。为了进一步提高我们模型的跨模态特征生成能力，我们引入了一致性损失，在特征生成过程中约束图像-语言相似性和语言-图像相似性矩阵。我们在三个公共数据集上评估了我们提出的方法，并通过全面的实验结果证明了其有效性。

相似文献

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.基于视频的指称物理解中的多阶段图像-语言交叉生成融合网络。

IEEE Trans Image Process. 2024;33:3256-3270. doi: 10.1109/TIP.2024.3394260. Epub 2024 May 8.

Cross-Modal Progressive Comprehension for Referring Segmentation.跨模态递进式理解的指代分割。

IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):4761-4775. doi: 10.1109/TPAMI.2021.3079993. Epub 2022 Aug 4.

Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.

Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割

IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.

Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.

Refined Feature-based Multi-frame and Multi-scale Fusing Gate network for accurate segmentation of plaques in ultrasound videos.基于精炼特征的多帧和多尺度融合门控网络，用于超声视频中斑块的精确分割。

Comput Biol Med. 2023 Sep;163:107091. doi: 10.1016/j.compbiomed.2023.107091. Epub 2023 Jun 7.

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器

Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.

Rethinking and Improving Feature Pyramids for One-Stage Referring Expression Comprehension.重新思考并改进用于单阶段指称表达式理解的特征金字塔

IEEE Trans Image Process. 2023;32:854-864. doi: 10.1109/TIP.2022.3227466. Epub 2023 Jan 23.

Improving Video Temporal Consistency via Broad Learning System.基于广谱学习系统的视频时间一致性改进。

IEEE Trans Cybern. 2022 Jul;52(7):6662-6675. doi: 10.1109/TCYB.2021.3079311. Epub 2022 Jul 4.

BTMF-GAN: A multi-modal MRI fusion generative adversarial network for brain tumors.BTMF-GAN：一种用于脑肿瘤的多模态 MRI 融合生成对抗网络。

Comput Biol Med. 2023 May;157:106769. doi: 10.1016/j.compbiomed.2023.106769. Epub 2023 Mar 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于视频的指称物理解中的多阶段图像-语言交叉生成融合网络。

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献