• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

VLT:用于指代分割的视觉-语言转换器和查询生成。

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7900-7916. doi: 10.1109/TPAMI.2022.3217852. Epub 2023 May 5.

DOI:10.1109/TPAMI.2022.3217852
PMID:36306296
Abstract

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

摘要

我们提出了一种视觉-语言转换器(VLT)框架用于指代分割,以促进多模态信息之间的深度交互,并增强对视觉-语言特征的整体理解。有不同的方法可以理解语言表达的动态重点,尤其是在与图像交互时。然而,现有转换器工作中学习到的查询在训练后是固定的,无法应对语言表达的随机性和巨大多样性。为了解决这个问题,我们提出了一个查询生成模块,它可以动态生成多组特定于输入的查询,以表示语言表达的不同理解。为了在这些不同的理解中找到最佳的理解,从而生成更好的掩模,我们提出了一个查询平衡模块来有选择地融合一组查询的相应响应。此外,为了增强模型处理不同语言表达的能力,我们考虑了样本间学习,以便为模型赋予理解同一对象的不同语言表达的知识。我们引入了掩蔽对比学习,以便在区分不同对象的特征的同时,缩小同一目标对象的不同表达的特征。所提出的方法是轻量级的,并在五个数据集上一致地实现了新的最先进的指代分割结果。

相似文献

1
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation.VLT:用于指代分割的视觉-语言转换器和查询生成。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7900-7916. doi: 10.1109/TPAMI.2022.3217852. Epub 2023 May 5.
2
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation.多模态互注意力和迭代交互的引用图像分割。
IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.
3
Language-Aware Vision Transformer for Referring Segmentation.用于指称分割的语言感知视觉Transformer
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5238-5255. doi: 10.1109/TPAMI.2024.3468640.
4
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
5
Automated multi-modal Transformer network (AMTNet) for 3D medical images segmentation.用于3D医学图像分割的自动多模态Transformer网络(AMTNet)。
Phys Med Biol. 2023 Jan 9;68(2). doi: 10.1088/1361-6560/aca74c.
6
Enhancing Query Formulation for Universal Image Segmentation.增强通用图像分割的查询公式制定
Sensors (Basel). 2024 Mar 14;24(6):1879. doi: 10.3390/s24061879.
7
Coarse Mask Guided Interactive Object Segmentation.粗掩码引导的交互式目标分割
IEEE Trans Image Process. 2023;32:5808-5822. doi: 10.1109/TIP.2023.3322564. Epub 2023 Oct 26.
8
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.
9
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么:语义信息对视觉表征的影响。
Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.
10
Toward Robust Referring Image Segmentation.迈向稳健的指称图像分割
IEEE Trans Image Process. 2024;33:1782-1794. doi: 10.1109/TIP.2024.3371348. Epub 2024 Mar 8.

引用本文的文献

1
SMF-net: semantic-guided multimodal fusion network for precise pancreatic tumor segmentation in medical CT image.SMF-net:用于医学CT图像中精确胰腺肿瘤分割的语义引导多模态融合网络
Front Oncol. 2025 Jul 18;15:1622426. doi: 10.3389/fonc.2025.1622426. eCollection 2025.
2
Large language model-augmented learning for auto-delineation of treatment targets in head-and-neck cancer radiotherapy.用于头颈癌放射治疗中治疗靶区自动勾画的大语言模型增强学习
Radiother Oncol. 2025 Apr;205:110740. doi: 10.1016/j.radonc.2025.110740. Epub 2025 Jan 22.
3
Large Language Model-Augmented Auto-Delineation of Treatment Target Volume in Radiation Therapy.
大语言模型增强的放射治疗中治疗靶区自动勾画
ArXiv. 2024 Jul 10:arXiv:2407.07296v1.