Suppr超能文献

用于图像字幕的视觉聚类基础

Visual Cluster Grounding for Image Captioning.

作者信息

Jiang Wenhui, Zhu Minwei, Fang Yuming, Shi Guangming, Zhao Xiaowei, Liu Yang

出版信息

IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

Abstract

Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k (Young et al., 2014) and MS COCO datasets (Lin et al., 2014), demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.

摘要

注意力机制已在诸如图像字幕生成等视觉和语言任务中得到广泛应用。它促使字幕生成模型在生成单词或短语时动态地将合适的图像区域作为依据,这对于缓解对象幻觉和语言偏差问题至关重要。然而,当前研究表明,现有字幕生成器的依据准确性仍远不尽人意。最近,人们致力于通过将单词与图像中对象的完整内容相联系来提高依据准确性。然而,由于依据注释存在噪声以及对象外观变化较大,这种严格的单词 - 对象对齐正则化对于提高字幕生成性能可能并非最优。在本文中,为了提高依据和字幕生成的性能,我们提出了一种新颖的依据模型,该模型将单词隐式地与图像中的证据相联系。所提出的模型促使字幕生成器动态地关注对象的信息丰富区域,这些区域可以是区分性部分或整个对象内容。通过放宽约束,所提出的字幕生成模型能够捕捉正确的语言特征和视觉相关性,进而生成更有依据的图像字幕。此外,我们提出了一种新颖的定量指标,通过考虑生成特定单词时所有对象提议的总体贡献来评估软注意力机制的正确性。所提出的依据模型可以无缝地插入到大多数基于注意力的架构中,而不会引入推理复杂性。我们在Flickr30k(Young等人,2014)和MS COCO数据集(Lin等人,2014)上进行了广泛的实验,结果表明所提出的方法在依据和字幕生成方面均持续改进了图像字幕。此外,所提出的注意力评估指标与字幕生成性能表现出更好的一致性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验