• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

迈向基于双路径神经模块网络的精确视觉推理

Toward Accurate Visual Reasoning With Dual-Path Neural Module Networks.

作者信息

Su Ke, Su Hang, Li Jianguo, Zhu Jun

机构信息

THBI Lab, Department of Computer Science and Technology, BNRist Center, Institute for AI, Tsinghua University, Beijing, China.

Intel Labs China, Beijing, China.

出版信息

Front Robot AI. 2020 Aug 21;7:109. doi: 10.3389/frobt.2020.00109. eCollection 2020.

DOI:10.3389/frobt.2020.00109
PMID:33501276
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7805672/
Abstract

Visual reasoning is a critical stage in visual question answering (Antol et al., 2015), but most of the state-of-the-art methods categorized the VQA tasks as a classification problem without taking the reasoning process into account. Various approaches are proposed to solve this multi-modal task that requires both abilities of comprehension and reasoning. The recently proposed neural module network (Andreas et al., 2016b), which assembles the model with a few primitive modules, is capable of performing a spatial or arithmetical reasoning over the input image to answer the questions. Nevertheless, its performance is not satisfying especially in the real-world datasets (e.g., VQA 1.0& 2.0) due to its limited primitive modules and suboptimal layout. To address these issues, we propose a novel method of Dual-Path Neural Module Network which can implement complex visual reasoning by forming a more flexible layout regularized by the pairwise loss. Specifically, we first use the region proposal network to generate both visual and spatial information, which helps it perform spatial reasoning. Then, we advocate to process a pair of different images along with the same question simultaneously, named as a "complementary pair," which encourages the model to learn a more reasonable layout by suppressing the overfitting to the language priors. The model can jointly learn the parameters in the primitive module and the layout generation policy, which is further boosted by introducing a novel pairwise reward. Extensive experiments show that our approach significantly improves the performance of neural module networks especially on the real-world datasets.

摘要

视觉推理是视觉问答中的一个关键阶段(安托尔等人,2015年),但大多数现有技术方法将视觉问答任务归类为分类问题,而没有考虑推理过程。人们提出了各种方法来解决这个需要理解和推理能力的多模态任务。最近提出的神经模块网络(安德烈亚斯等人,2016b),它用一些原始模块组装模型,能够对输入图像进行空间或算术推理以回答问题。然而,由于其原始模块有限且布局欠佳,其性能并不令人满意,尤其是在真实世界数据集(如VQA 1.0和2.0)中。为了解决这些问题,我们提出了一种双路径神经模块网络的新方法,它可以通过形成由成对损失正则化的更灵活布局来实现复杂的视觉推理。具体来说,我们首先使用区域提议网络生成视觉和空间信息,这有助于它进行空间推理。然后,我们主张同时处理一对不同的图像以及相同的问题,称为“互补对”,这通过抑制对语言先验的过拟合来鼓励模型学习更合理的布局。该模型可以联合学习原始模块中的参数和布局生成策略,通过引入一种新颖的成对奖励进一步提升性能。大量实验表明,我们的方法显著提高了神经模块网络的性能,尤其是在真实世界数据集上。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/f6d7d93e04b3/frobt-07-00109-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/648510e281c4/frobt-07-00109-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/7c3225b78f61/frobt-07-00109-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/4bf2bc628a8d/frobt-07-00109-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/f6d7d93e04b3/frobt-07-00109-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/648510e281c4/frobt-07-00109-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/7c3225b78f61/frobt-07-00109-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/4bf2bc628a8d/frobt-07-00109-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f05/7805672/f6d7d93e04b3/frobt-07-00109-g0004.jpg

相似文献

1
Toward Accurate Visual Reasoning With Dual-Path Neural Module Networks.迈向基于双路径神经模块网络的精确视觉推理
Front Robot AI. 2020 Aug 21;7:109. doi: 10.3389/frobt.2020.00109. eCollection 2020.
2
An effective spatial relational reasoning networks for visual question answering.用于视觉问答的有效的空间关系推理网络。
PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693. eCollection 2022.
3
Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering.基于树结构的知识推理在视觉问答中的应用
Sensors (Basel). 2022 Feb 17;22(4):1575. doi: 10.3390/s22041575.
4
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.
5
Interpretable Visual Question Answering by Reasoning on Dependency Trees.基于依存树推理的可解释视觉问答。
IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):887-901. doi: 10.1109/TPAMI.2019.2943456. Epub 2021 Feb 4.
6
Robust visual question answering via polarity enhancement and contrast.通过极性增强和对比实现鲁棒的视觉问答。
Neural Netw. 2024 Nov;179:106560. doi: 10.1016/j.neunet.2024.106560. Epub 2024 Jul 20.
7
DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering.DisAVR:用于图表问答的解缠自适应视觉推理网络
IEEE Trans Image Process. 2023;32:4812-4827. doi: 10.1109/TIP.2023.3306910. Epub 2023 Aug 29.
8
Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning.基于条件推理和对比学习的医学视觉问答。
IEEE Trans Med Imaging. 2023 May;42(5):1532-1545. doi: 10.1109/TMI.2022.3232411. Epub 2023 May 2.
9
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding.知识引导的视觉问题推理:深度表示嵌入面临的挑战
IEEE Trans Neural Netw Learn Syst. 2022 Jul;33(7):2758-2767. doi: 10.1109/TNNLS.2020.3045034. Epub 2022 Jul 6.
10
Structured Multimodal Attentions for TextVQA.面向文本视觉问答的结构化多模态注意力
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9603-9614. doi: 10.1109/TPAMI.2021.3132034. Epub 2022 Nov 7.

本文引用的文献

1
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.更快的 R-CNN:基于区域建议网络的实时目标检测。
IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.