Zhang Jing, Fang Zhongjun, Sun Han, Wang Zhe
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):1785-1796. doi: 10.1109/TNNLS.2022.3185320. Epub 2024 Feb 5.
In the research on image captioning, rich semantic information is very important for generating critical caption words as guiding information. However, semantic information from offline object detectors involves many semantic objects that do not appear in the caption, thereby bringing noise into the decoding process. To produce more accurate semantic guiding information and further optimize the decoding process, we propose an end-to-end adaptive semantic-enhanced transformer (AS-Transformer) model for image captioning. For semantic enhancement information extraction, we propose a constrained weaklysupervised learning (CWSL) module, which reconstructs the semantic object's probability distribution detected by the multiple instances learning (MIL) through a joint loss function. These strengthened semantic objects from the reconstructed probability distribution can better depict the semantic meaning of images. Also, for semantic enhancement decoding, we propose an adaptive gated mechanism (AGM) module to adjust the attention between visual and semantic information adaptively for the more accurate generation of caption words. Through the joint control of the CWSL module and AGM module, our proposed model constructs a complete adaptive enhancement mechanism from encoding to decoding and obtains visual context that is more suitable for captions. Experiments on the public Microsoft Common Objects in COntext (MSCOCO) and Flickr30K datasets illustrate that our proposed AS-Transformer can adaptively obtain effective semantic information and adjust the attention weights between semantic and visual information automatically, which achieves more accurate captions compared with semantic enhancement methods and outperforms state-of-the-art methods.
在图像字幕研究中,丰富的语义信息对于生成关键字幕词作为指导信息非常重要。然而,来自离线目标检测器的语义信息涉及许多未出现在字幕中的语义对象,从而给解码过程带来噪声。为了产生更准确的语义指导信息并进一步优化解码过程,我们提出了一种用于图像字幕的端到端自适应语义增强Transformer(AS-Transformer)模型。对于语义增强信息提取,我们提出了一种约束弱监督学习(CWSL)模块。该模块通过联合损失函数重建由多实例学习(MIL)检测到的语义对象的概率分布。从重建概率分布中得到的这些增强语义对象能够更好地描述图像的语义含义。此外,对于语义增强解码,我们提出了一种自适应门控机制(AGM)模块,以自适应地调整视觉和语义信息之间的注意力,从而更准确地生成字幕词。通过CWSL模块和AGM模块的联合控制,我们提出的模型构建了一个从编码到解码的完整自适应增强机制,并获得了更适合字幕的视觉上下文。在公共的微软上下文常见物体(MSCOCO)和Flickr30K数据集上的实验表明,我们提出的AS-Transformer能够自适应地获取有效语义信息,并自动调整语义和视觉信息之间的注意力权重,与语义增强方法相比,它能实现更准确的字幕,并且优于当前的先进方法。