Shanghai Key Lab of Modern Optical System, School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, 20093, Shanghai, China.
Neural Netw. 2021 Jun;138:57-67. doi: 10.1016/j.neunet.2021.01.023. Epub 2021 Feb 10.
Synthesizing photo-realistic images based on text descriptions is a challenging task in the field of computer vision. Although generative adversarial networks have made significant breakthroughs in this task, they still face huge challenges in generating high-quality visually realistic images consistent with the semantics of text. Generally, existing text-to-image methods accomplish this task with two steps, that is, first generating an initial image with a rough outline and color, and then gradually yielding the image within high-resolution from the initial image. However, one drawback of these methods is that, if the quality of the initial image generation is not high, it is hard to generate a satisfactory high-resolution image. In this paper, we propose SAM-GAN, Self-Attention supporting Multi-stage Generative Adversarial Networks, for text-to-image synthesis. With the self-attention mechanism, the model can establish the multi-level dependence of the image and fuse the sentence- and word-level visual-semantic vectors, to improve the quality of the generated image. Furthermore, a multi-stage perceptual loss is introduced to enhance the semantic similarity between the synthesized image and the real image, thus enhancing the visual-semantic consistency between text and images. For the diversity of the generated images, a mode seeking regularization term is integrated into the model. The results of extensive experiments and ablation studies, which were conducted in the Caltech-UCSD Birds and Microsoft Common Objects in Context datasets, show that our model is superior to competitive models in text-to-image synthesis.
基于文本描述生成逼真图像是计算机视觉领域的一项具有挑战性的任务。尽管生成对抗网络在这项任务中取得了重大突破,但它们在生成与文本语义一致的高质量视觉逼真图像方面仍面临巨大挑战。通常,现有的文本到图像方法分两步完成此任务,即首先生成具有粗略轮廓和颜色的初始图像,然后从初始图像逐渐生成高分辨率的图像。然而,这些方法的一个缺点是,如果初始图像生成的质量不高,就很难生成令人满意的高分辨率图像。在本文中,我们提出了 SAM-GAN,一种支持多阶段生成对抗网络的自注意力,用于文本到图像的合成。通过自注意力机制,模型可以建立图像的多层次依赖关系,并融合句子和单词级别的视觉语义向量,从而提高生成图像的质量。此外,引入了多阶段感知损失,以增强合成图像与真实图像之间的语义相似性,从而增强文本和图像之间的视觉语义一致性。为了增加生成图像的多样性,我们将模式搜索正则化项集成到模型中。在 Caltech-UCSD Birds 和 Microsoft Common Objects in Context 数据集上进行的广泛实验和消融研究的结果表明,我们的模型在文本到图像合成方面优于竞争模型。