• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

去混淆图像字幕:因果回顾

Deconfounded Image Captioning: A Causal Retrospect.

作者信息

Yang Xu, Zhang Hanwang, Cai Jianfei

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12996-13010. doi: 10.1109/TPAMI.2021.3121705. Epub 2023 Oct 3.

DOI:10.1109/TPAMI.2021.3121705
PMID:34673483
Abstract

Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.

摘要

视觉语言任务中的数据集偏差正成为阻碍我们这个领域发展的主要问题之一。现有的解决方案缺乏对现代图像字幕生成器为何容易陷入数据集偏差的原则性分析。在本文中,我们提出了一个新颖的观点:去混淆图像字幕生成(DIC),以找出这个问题的答案,然后回顾现代神经图像字幕生成器,最后提出一个DIC框架:DICv1.0,以减轻数据集偏差带来的负面影响。DIC基于因果推理,其两个原则:后门调整和前门调整,帮助我们回顾以往的研究并设计新的有效模型。特别是,我们展示了DICv1.0可以强化两种主流的字幕生成模型,并且在具有挑战性的MS COCO数据集的Karpathy分割和在线分割上,分别可以实现单模型131.1的CIDEr-D和128.4的c40 CIDEr-D。有趣的是,DICv1.0是我们因果回顾的自然衍生,为图像字幕生成开辟了有前景的方向。

相似文献

1
Deconfounded Image Captioning: A Causal Retrospect.去混淆图像字幕:因果回顾
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12996-13010. doi: 10.1109/TPAMI.2021.3121705. Epub 2023 Oct 3.
2
Auto-Encoding and Distilling Scene Graphs for Image Captioning.自动编码和场景图蒸馏用于图像字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2313-2327. doi: 10.1109/TPAMI.2020.3042192. Epub 2022 Apr 1.
3
Visual Cluster Grounding for Image Captioning.用于图像字幕的视觉聚类基础
IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.
4
Attention-Guided Image Captioning through Word Information.基于词信息的注意力引导图像字幕生成。
Sensors (Basel). 2021 Nov 30;21(23):7982. doi: 10.3390/s21237982.
5
Image Captioning via Dynamic Path Customization.通过动态路径定制实现图像字幕生成
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6203-6217. doi: 10.1109/TNNLS.2024.3409354. Epub 2025 Apr 4.
6
Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network.标题生成:一种基于改进胶囊网络的现代图像标题生成方法。
Sensors (Basel). 2022 Nov 1;22(21):8376. doi: 10.3390/s22218376.
7
Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning.知晓学习内容:一种面向度量的图像字幕焦点机制
IEEE Trans Image Process. 2022;31:4321-4335. doi: 10.1109/TIP.2022.3183434. Epub 2022 Jun 30.
8
Social Image Captioning: Exploring Visual Attention and User Attention.社交图像字幕生成:探索视觉注意与用户注意。
Sensors (Basel). 2018 Feb 22;18(2):646. doi: 10.3390/s18020646.
9
Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning.利用跨模态预测和关系一致性进行半监督图像字幕生成
IEEE Trans Cybern. 2024 Feb;54(2):890-902. doi: 10.1109/TCYB.2022.3156367. Epub 2024 Jan 17.
10
Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.基于端到端属性检测及后续属性预测的图像字幕生成
IEEE Trans Image Process. 2020 Jan 30. doi: 10.1109/TIP.2020.2969330.

引用本文的文献

1
Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.视觉问答中的跨模态偏差:基于可能世界视觉问答的因果观点
IEEE Trans Multimedia. 2024;26:8609-8624. doi: 10.1109/tmm.2024.3380259. Epub 2024 Mar 21.
2
Backdoor Adjustment of Confounding by Provenance for Robust Text Classification of Multi-institutional Clinical Notes.通过来源归因调整混杂因素,实现稳健的多机构临床记录的文本分类。
AMIA Annu Symp Proc. 2024 Jan 11;2023:923-932. eCollection 2023.