Yang Xu, Zhang Hanwang, Cai Jianfei
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12996-13010. doi: 10.1109/TPAMI.2021.3121705. Epub 2023 Oct 3.
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.
视觉语言任务中的数据集偏差正成为阻碍我们这个领域发展的主要问题之一。现有的解决方案缺乏对现代图像字幕生成器为何容易陷入数据集偏差的原则性分析。在本文中,我们提出了一个新颖的观点:去混淆图像字幕生成(DIC),以找出这个问题的答案,然后回顾现代神经图像字幕生成器,最后提出一个DIC框架:DICv1.0,以减轻数据集偏差带来的负面影响。DIC基于因果推理,其两个原则:后门调整和前门调整,帮助我们回顾以往的研究并设计新的有效模型。特别是,我们展示了DICv1.0可以强化两种主流的字幕生成模型,并且在具有挑战性的MS COCO数据集的Karpathy分割和在线分割上,分别可以实现单模型131.1的CIDEr-D和128.4的c40 CIDEr-D。有趣的是,DICv1.0是我们因果回顾的自然衍生,为图像字幕生成开辟了有前景的方向。