Song Peipei, Guo Dan, Zhou Jinxing, Xu Mingliang, Wang Meng
IEEE Trans Cybern. 2023 Jul;53(7):4388-4399. doi: 10.1109/TCYB.2022.3175012. Epub 2023 Jun 15.
Most works of image captioning are implemented under the full supervision of paired image-caption data. Limited to expensive cost of data collection, the task of unpaired image captioning has attracted researchers' attention. In this article, we propose a novel memorial GAN (MemGAN) with the joint semantic optimization for unpaired image captioning. The core idea is to explore implicit semantic correlation between disjointed images and sentences through building a multimodal semantic-aware space (SAS). Concretely, each modality is mapped into a unified multimodal SAS, where SAS includes the semantic vectors of image I , visual concepts O , unpaired sentence S , and the generated caption C . We adopt the memory unit based on multihead attention and relational gate as a backbone to preserve and transit crucial multimodal semantics in the SAS for image caption generation and sentence reconstruction. Then, the memory unit is embedded into a GAN framework to exploit the semantic similarity and relevance in SAS, that is, imposing a joint semantic-aware optimization on SAS without supervision clues. To summarize, the proposed MemGAN learns the latent semantic relevance of SAS's multimodalities in an adversarial manner. Extensive experiments and qualitative results demonstrate the effectiveness of MemGAN, achieving improvements over state of the arts on unpaired image captioning benchmarks.
大多数图像字幕作品都是在成对的图像-字幕数据的完全监督下实现的。由于数据收集成本高昂,未配对图像字幕任务引起了研究人员的关注。在本文中,我们提出了一种新颖的记忆生成对抗网络(MemGAN),用于未配对图像字幕的联合语义优化。其核心思想是通过构建多模态语义感知空间(SAS)来探索不相关图像和句子之间的隐式语义关联。具体来说,每个模态都被映射到一个统一的多模态SAS中,其中SAS包括图像I的语义向量、视觉概念O、未配对句子S和生成的字幕C。我们采用基于多头注意力和关系门的记忆单元作为主干,以在SAS中保存和传递关键的多模态语义,用于图像字幕生成和句子重建。然后,将记忆单元嵌入到生成对抗网络框架中,以利用SAS中的语义相似性和相关性,即在没有监督线索的情况下对SAS进行联合语义感知优化。总之,所提出的MemGAN以对抗的方式学习SAS多模态的潜在语义相关性。大量实验和定性结果证明了MemGAN的有效性,在未配对图像字幕基准上比现有技术有了改进。