• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

面向单阶段视觉目标定位的语言定制化渐进式视觉特征学习

Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding.

出版信息

IEEE Trans Image Process. 2022;31:4266-4277. doi: 10.1109/TIP.2022.3181516. Epub 2022 Jun 29.

DOI:10.1109/TIP.2022.3181516
PMID:35709109
Abstract

Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material.

摘要

视觉定位是一项将句子中描述的对象在图像中定位的任务。传统的视觉定位方法分别提取视觉和语言特征,然后以后融合的方式进行跨模态交互。我们认为这种后融合机制没有充分利用两种模态中的信息。相反,在视觉和语言特征的提取过程中进行跨模态交互更为可取。在本文中,我们提出了一种语言定制的视觉特征学习机制,其中语言信息从一开始就指导视觉特征的提取。我们将该机制实例化为一个名为渐进式语言定制视觉特征学习(PLV)的单阶段框架。我们提出的 PLV 由渐进式语言定制视觉编码器(PLVE)和定位模块组成。我们通过通道语言引导交互模块(CLIM)在 PLVE 的每个阶段用语言指导来定制视觉特征。我们提出的 PLV 在五个视觉定位数据集上的表现优于传统的最先进方法,且在不预先训练目标检测数据集的情况下,具有较大的优势,同时实现了实时速度。源代码可在补充材料中获得。

相似文献

1
Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding.面向单阶段视觉目标定位的语言定制化渐进式视觉特征学习
IEEE Trans Image Process. 2022;31:4266-4277. doi: 10.1109/TIP.2022.3181516. Epub 2022 Jun 29.
2
Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding.Word2Pix:视觉基础中的词到像素交叉注意力变换器
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):1523-1533. doi: 10.1109/TNNLS.2022.3183827. Epub 2024 Feb 5.
3
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding.动态MDETR:用于视觉定位的动态多模态变压器解码器
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1181-1198. doi: 10.1109/TPAMI.2023.3328185. Epub 2024 Jan 8.
4
Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts.利用位置引导文本提示增强视觉语言预训练中的视觉基础
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3406-3421. doi: 10.1109/TPAMI.2023.3343736. Epub 2024 Apr 3.
5
Relationship-Embedded Representation Learning for Grounding Referring Expressions.用于基础指代表达的关系嵌入表示学习
IEEE Trans Pattern Anal Mach Intell. 2021 Aug;43(8):2765-2779. doi: 10.1109/TPAMI.2020.2973983. Epub 2021 Jul 1.
6
TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer.TransVG++:基于语言条件视觉Transformer的端到端视觉基础
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13636-13652. doi: 10.1109/TPAMI.2023.3296823. Epub 2023 Oct 3.
7
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
8
Human-Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration.基于全局上下文和对级别融合特征集成的人与对象交互检测。
Neural Netw. 2024 Feb;170:242-253. doi: 10.1016/j.neunet.2023.11.002. Epub 2023 Nov 13.
9
Intention-Related Natural Language Grounding via Object Affordance Detection and Intention Semantic Extraction.通过对象可供性检测和意图语义提取实现与意图相关的自然语言基础
Front Neurorobot. 2020 May 13;14:26. doi: 10.3389/fnbot.2020.00026. eCollection 2020.
10
Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations.具有个体和上下文表示的循环一致弱监督视觉定位
IEEE Trans Image Process. 2023;32:5167-5180. doi: 10.1109/TIP.2023.3311917. Epub 2023 Sep 15.