Suppr超能文献

基于跨模态自注意力网络的图像和视频指代分割

Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.

作者信息

Ye Linwei, Rochan Mrigank, Liu Zhi, Zhang Xiaoqin, Wang Yang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.

Abstract

We consider the problem of referring segmentation in images and videos with natural language. Given an input image (or video) and a referring expression, the goal is to segment the entity referred by the expression in the image or video. In this paper, we propose a cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the visual input. We further propose a gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features corresponding to different levels of visual features. This module controls the feature fusion of information flow of features at different levels with high-level and low-level semantic information related to different attentive words. Besides, we introduce cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames which extends our method in the case of referring segmentation in videos. Experiments on benchmark datasets of four referring image datasets and two actor and action video segmentation datasets consistently demonstrate that our proposed approach outperforms existing state-of-the-art methods.

摘要

我们考虑使用自然语言对图像和视频进行指称分割的问题。给定一幅输入图像(或视频)以及一个指称表达式,目标是在图像或视频中分割出该表达式所指的实体。在本文中,我们提出了一种跨模态自注意力(CMSA)模块,以利用单个单词以及输入图像或视频的精细细节,从而有效地捕捉语言和视觉特征之间的长距离依赖关系。我们的模型能够自适应地聚焦于指称表达式中的信息丰富的单词以及视觉输入中的重要区域。我们进一步提出了一种门控多级融合(GMLF)模块,以选择性地整合与不同层次视觉特征相对应的自注意力跨模态特征。该模块利用与不同注意力单词相关的高级和低级语义信息来控制不同层次特征信息流的特征融合。此外,我们引入了跨帧自注意力(CFSA)模块,以有效地整合连续帧中的时间信息,这在视频指称分割的情况下扩展了我们的方法。在四个指称图像数据集以及两个演员和动作视频分割数据集的基准数据集上进行的实验一致表明,我们提出的方法优于现有的最先进方法。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验