• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视觉问答的语义感知模块化胶囊路由

Semantic-Aware Modular Capsule Routing for Visual Question Answering.

作者信息

Han Yudong, Yin Jianhua, Wu Jianlong, Wei Yinwei, Nie Liqiang

出版信息

IEEE Trans Image Process. 2023;32:5537-5549. doi: 10.1109/TIP.2023.3318949. Epub 2023 Oct 5.

DOI:10.1109/TIP.2023.3318949
PMID:37773902
Abstract

Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE Routing framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction. Particularly, five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network, and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets, as well as the parametric-efficient advantage. It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA. Instead, we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA.

摘要

视觉问答(VQA)本质上具有高度的组合性,许多问题只需将其分解为模块化子问题就能得到解答。最近提出的神经模块网络(NMN)将这种策略应用于问答,但在网络架构设计方面严重依赖现成的布局解析器或额外的专家策略,而非从数据中学习。这些策略导致对输入语义复杂变化的适应性不佳,从而阻碍了模型的表征能力和泛化能力。为了解决这个问题,我们提出了一个语义感知模块化胶囊路由框架,称为SUPER,以更好地捕捉特定实例的视觉语义特征,并优化判别性表征用于预测。具体而言,在SUPER网络的每一层都定制了五个强大的专用模块以及动态路由器,并构建了紧凑的路由空间,以便充分利用各种可定制路由并明确校准视觉语义表征。我们通过五个基准数据集比较验证了所提出的SUPER方案的有效性和泛化能力,以及参数高效的优势。值得强调的是,这项工作并非追求VQA领域的最新成果。相反,我们期望我们的模型能够为VQA的架构学习和表征校准提供一个全新的视角。

相似文献

1
Semantic-Aware Modular Capsule Routing for Visual Question Answering.用于视觉问答的语义感知模块化胶囊路由
IEEE Trans Image Process. 2023;32:5537-5549. doi: 10.1109/TIP.2023.3318949. Epub 2023 Oct 5.
2
MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network.MRA-Net:基于多模态关系注意力网络的视觉问答任务改进。
IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):318-329. doi: 10.1109/TPAMI.2020.3004830. Epub 2021 Dec 7.
3
A Bi-level representation learning model for medical visual question answering.用于医学视觉问答的双层表示学习模型。
J Biomed Inform. 2022 Oct;134:104183. doi: 10.1016/j.jbi.2022.104183. Epub 2022 Aug 28.
4
Medical visual question answering based on question-type reasoning and semantic space constraint.基于问题类型推理和语义空间约束的医学视觉问答。
Artif Intell Med. 2022 Sep;131:102346. doi: 10.1016/j.artmed.2022.102346. Epub 2022 Jun 30.
5
Transformer Module Networks for Systematic Generalization in Visual Question Answering.用于视觉问答中系统泛化的Transformer模块网络
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10096-10105. doi: 10.1109/TPAMI.2024.3438887. Epub 2024 Nov 7.
6
Bridging the Cross-Modality Semantic Gap in Visual Question Answering.弥合视觉问答中的跨模态语义鸿沟。
IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):4519-4531. doi: 10.1109/TNNLS.2024.3370925. Epub 2025 Feb 28.
7
An effective spatial relational reasoning networks for visual question answering.用于视觉问答的有效的空间关系推理网络。
PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693. eCollection 2022.
8
Multitask Learning for Visual Question Answering.用于视觉问答的多任务学习
IEEE Trans Neural Netw Learn Syst. 2023 Mar;34(3):1380-1394. doi: 10.1109/TNNLS.2021.3105284. Epub 2023 Feb 28.
9
Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering.超越双线性:用于视觉问答的广义多模态因式分解高阶池化
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5947-5959. doi: 10.1109/TNNLS.2018.2817340. Epub 2018 Apr 9.
10
Loss Re-Scaling VQA: Revisiting the Language Prior Problem From a Class-Imbalance View.损失重缩放视觉问答:从类别不平衡视角重新审视语言先验问题
IEEE Trans Image Process. 2022;31:227-238. doi: 10.1109/TIP.2021.3128322. Epub 2021 Dec 7.