用于图像字幕的视觉聚类基础

Jiang Wenhui, Zhu Minwei, Fang Yuming, Shi Guangming, Zhao Xiaowei, Liu Yang

IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k (Young et al., 2014) and MS COCO datasets (Lin et al., 2014), demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.

注意力机制已在诸如图像字幕生成等视觉和语言任务中得到广泛应用。它促使字幕生成模型在生成单词或短语时动态地将合适的图像区域作为依据，这对于缓解对象幻觉和语言偏差问题至关重要。然而，当前研究表明，现有字幕生成器的依据准确性仍远不尽人意。最近，人们致力于通过将单词与图像中对象的完整内容相联系来提高依据准确性。然而，由于依据注释存在噪声以及对象外观变化较大，这种严格的单词 - 对象对齐正则化对于提高字幕生成性能可能并非最优。在本文中，为了提高依据和字幕生成的性能，我们提出了一种新颖的依据模型，该模型将单词隐式地与图像中的证据相联系。所提出的模型促使字幕生成器动态地关注对象的信息丰富区域，这些区域可以是区分性部分或整个对象内容。通过放宽约束，所提出的字幕生成模型能够捕捉正确的语言特征和视觉相关性，进而生成更有依据的图像字幕。此外，我们提出了一种新颖的定量指标，通过考虑生成特定单词时所有对象提议的总体贡献来评估软注意力机制的正确性。所提出的依据模型可以无缝地插入到大多数基于注意力的架构中，而不会引入推理复杂性。我们在Flickr30k（Young等人，2014）和MS COCO数据集（Lin等人，2014）上进行了广泛的实验，结果表明所提出的方法在依据和字幕生成方面均持续改进了图像字幕。此外，所提出的注意力评估指标与字幕生成性能表现出更好的一致性。

相似文献

Visual Cluster Grounding for Image Captioning.用于图像字幕的视觉聚类基础

IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

RefCap: image captioning with referent objects attributes.RefCap：具有指称对象属性的图像字幕生成

Sci Rep. 2023 Dec 7;13(1):21577. doi: 10.1038/s41598-023-48916-6.

Image Captioning Using Motion-CNN with Object Detection.基于运动卷积神经网络的图像字幕生成与目标检测

Sensors (Basel). 2021 Feb 10;21(4):1270. doi: 10.3390/s21041270.

Dual Global Enhanced Transformer for image captioning.双全局增强型 Transformer 用于图像字幕生成。

Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.

Context-Aware Visual Policy Network for Fine-Grained Image Captioning.上下文感知视觉策略网络在细粒度图像标题生成中的应用

IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):710-722. doi: 10.1109/TPAMI.2019.2909864. Epub 2022 Jan 7.

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.基于序列级训练的上下文融合图像字幕生成

Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.

Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning.知晓学习内容：一种面向度量的图像字幕焦点机制

IEEE Trans Image Process. 2022;31:4321-4335. doi: 10.1109/TIP.2022.3183434. Epub 2022 Jun 30.

Dense Relational Image Captioning via Multi-Task Triple-Stream Networks.通过多任务三流网络实现密集关系图像字幕生成

IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):7348-7362. doi: 10.1109/TPAMI.2021.3119754. Epub 2022 Oct 4.

GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning.GLCM：用于遥感图像字幕的全局-局部字幕模型。

IEEE Trans Cybern. 2023 Nov;53(11):6910-6922. doi: 10.1109/TCYB.2022.3222606. Epub 2023 Oct 17.

Chinese Image Caption Generation via Visual Attention and Topic Modeling.基于视觉注意和主题建模的中文图像字幕生成。

IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Visual Cluster Grounding for Image Captioning.用于图像字幕的视觉聚类基础

IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

RefCap: image captioning with referent objects attributes.RefCap：具有指称对象属性的图像字幕生成

Sci Rep. 2023 Dec 7;13(1):21577. doi: 10.1038/s41598-023-48916-6.

Image Captioning Using Motion-CNN with Object Detection.基于运动卷积神经网络的图像字幕生成与目标检测

Sensors (Basel). 2021 Feb 10;21(4):1270. doi: 10.3390/s21041270.

Dual Global Enhanced Transformer for image captioning.双全局增强型 Transformer 用于图像字幕生成。

Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.

Context-Aware Visual Policy Network for Fine-Grained Image Captioning.上下文感知视觉策略网络在细粒度图像标题生成中的应用

IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):710-722. doi: 10.1109/TPAMI.2019.2909864. Epub 2022 Jan 7.

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.基于序列级训练的上下文融合图像字幕生成

Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.

Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning.知晓学习内容：一种面向度量的图像字幕焦点机制

IEEE Trans Image Process. 2022;31:4321-4335. doi: 10.1109/TIP.2022.3183434. Epub 2022 Jun 30.

Dense Relational Image Captioning via Multi-Task Triple-Stream Networks.通过多任务三流网络实现密集关系图像字幕生成

IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):7348-7362. doi: 10.1109/TPAMI.2021.3119754. Epub 2022 Oct 4.

GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning.GLCM：用于遥感图像字幕的全局-局部字幕模型。

IEEE Trans Cybern. 2023 Nov;53(11):6910-6922. doi: 10.1109/TCYB.2022.3222606. Epub 2023 Oct 17.

Chinese Image Caption Generation via Visual Attention and Topic Modeling.基于视觉注意和主题建模的中文图像字幕生成。

IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.

Visual Cluster Grounding for Image Captioning.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献