• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于局部场景感知的指代生成的视觉问答。

Visual question answering based on local-scene-aware referring expression generation.

机构信息

Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Republic of Korea.

Department of Artificial Intelligence, Kyungpook National University, Daehak-ro, Buk-gu, Daegu 41566, Republic of Korea.

出版信息

Neural Netw. 2021 Jul;139:158-167. doi: 10.1016/j.neunet.2021.02.001. Epub 2021 Feb 24.

DOI:10.1016/j.neunet.2021.02.001
PMID:33714005
Abstract

Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.

摘要

视觉问答需要对图像和自然语言有深入的理解。然而,大多数方法主要侧重于视觉概念,例如各种对象之间的关系。有限地使用对象类别及其关系或简单的问题嵌入不足以表示复杂的场景和解释决策。为了解决这个限制,我们提出使用为图像生成的文本表达,因为这种表达的结构约束较少,可以为图像提供更丰富的描述。生成的表达式可以与视觉特征和问题嵌入相结合,以获得与问题相关的答案。还提出了一个联合嵌入多头注意网络,以共同注意的方式对三种不同的信息模态进行建模。我们在 VQA v2 数据集上对所提出的方法进行了定量和定性评估,并在答案预测方面与最先进的方法进行了比较。还在 RefCOCO、RefCOCO+和 RefCOCOg 数据集上评估了生成的表达式的质量。实验结果证明了所提出方法的有效性,并表明在定量和定性结果方面,它都优于所有竞争方法。

相似文献

1
Visual question answering based on local-scene-aware referring expression generation.基于局部场景感知的指代生成的视觉问答。
Neural Netw. 2021 Jul;139:158-167. doi: 10.1016/j.neunet.2021.02.001. Epub 2021 Feb 24.
2
Robust visual question answering via polarity enhancement and contrast.通过极性增强和对比实现鲁棒的视觉问答。
Neural Netw. 2024 Nov;179:106560. doi: 10.1016/j.neunet.2024.106560. Epub 2024 Jul 20.
3
Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.多模态显式稀疏注意力网络的视觉问答。
Sensors (Basel). 2020 Nov 26;20(23):6758. doi: 10.3390/s20236758.
4
Advancing surgical VQA with scene graph knowledge.利用场景图知识推进外科视觉问答。
Int J Comput Assist Radiol Surg. 2024 Jul;19(7):1409-1417. doi: 10.1007/s11548-024-03141-y. Epub 2024 May 23.
5
MAGE: Multi-scale Context-aware Interaction based on Multi-granularity Embedding for Chinese Medical Question Answer Matching.MAGE:基于多粒度嵌入的多尺度上下文感知交互,用于中医问答匹配
Comput Methods Programs Biomed. 2023 Jan;228:107249. doi: 10.1016/j.cmpb.2022.107249. Epub 2022 Nov 17.
6
MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network.MRA-Net:基于多模态关系注意力网络的视觉问答任务改进。
IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):318-329. doi: 10.1109/TPAMI.2020.3004830. Epub 2021 Dec 7.
7
Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning.视觉-语言-知识共同嵌入的视觉常识推理
Sensors (Basel). 2021 Apr 21;21(9):2911. doi: 10.3390/s21092911.
8
An Effective Dense Co-Attention Networks for Visual Question Answering.一种用于视觉问答的高效密集协同注意力网络。
Sensors (Basel). 2020 Aug 30;20(17):4897. doi: 10.3390/s20174897.
9
Exploring Duality in Visual Question-Driven Top-Down Saliency.探索视觉问题驱动的自上而下显著性中的二元性。
IEEE Trans Neural Netw Learn Syst. 2020 Jul;31(7):2672-2679. doi: 10.1109/TNNLS.2019.2933439. Epub 2019 Sep 2.
10
Parallel multi-head attention and term-weighted question embedding for medical visual question answering.用于医学视觉问答的并行多头注意力机制和词加权问题嵌入
Multimed Tools Appl. 2023 Mar 11:1-22. doi: 10.1007/s11042-023-14981-2.