VLT：用于指代分割的视觉-语言转换器和查询生成。

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7900-7916. doi: 10.1109/TPAMI.2022.3217852. Epub 2023 May 5.

DOI:10.1109/TPAMI.2022.3217852

Abstract

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

摘要

我们提出了一种视觉-语言转换器（VLT）框架用于指代分割，以促进多模态信息之间的深度交互，并增强对视觉-语言特征的整体理解。有不同的方法可以理解语言表达的动态重点，尤其是在与图像交互时。然而，现有转换器工作中学习到的查询在训练后是固定的，无法应对语言表达的随机性和巨大多样性。为了解决这个问题，我们提出了一个查询生成模块，它可以动态生成多组特定于输入的查询，以表示语言表达的不同理解。为了在这些不同的理解中找到最佳的理解，从而生成更好的掩模，我们提出了一个查询平衡模块来有选择地融合一组查询的相应响应。此外，为了增强模型处理不同语言表达的能力，我们考虑了样本间学习，以便为模型赋予理解同一对象的不同语言表达的知识。我们引入了掩蔽对比学习，以便在区分不同对象的特征的同时，缩小同一目标对象的不同表达的特征。所提出的方法是轻量级的，并在五个数据集上一致地实现了新的最先进的指代分割结果。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

VLT：用于指代分割的视觉-语言转换器和查询生成。

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.

出版信息

相似文献

引用本文的文献

VLT：用于指代分割的视觉-语言转换器和查询生成。

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.

出版信息

相似文献

引用本文的文献