Yang Yang, Zhou Jie, Ai Jiangbo, Bin Yi, Hanjalic Alan, Shen Heng Tao, Ji Yanli
IEEE Trans Image Process. 2018 Jul 12. doi: 10.1109/TIP.2018.2855422.
In this paper, we propose a novel approach to video captioning based on adversarial learning and Long-Short Term Memory (LSTM). With this solution concept we aim at compensating for the deficiencies of LSTM-based video captioning methods that generally show potential to effectively handle temporal nature of video data when generating captions, but that also typically suffer from exponential error accumulation. Specifically, we adopt a standard Generative Adversarial Network (GAN) architecture, characterized by an interplay of two competing processes: a "generator", which generates textual sentences given the visual content of a video, and a "discriminator" which controls the accuracy of the generated sentences. The discriminator acts as an "adversary" towards the generator and with its controlling mechanism helps the generator to become more accurate. For the generator module, we take an existing video captioning concept using LSTM network. For the discriminator, we propose a novel realization specifically tuned for the video captioning problem and taking both the sentences and video features as input. This leads to our proposed LSTM-GAN system architecture, for which we show experimentally to significantly outperform the existing methods on standard public datasets.
在本文中,我们提出了一种基于对抗学习和长短期记忆(LSTM)的视频字幕生成新方法。基于这个解决方案概念,我们旨在弥补基于LSTM的视频字幕生成方法的不足,这些方法在生成字幕时通常有潜力有效处理视频数据的时间特性,但通常也会遭受指数级误差积累的问题。具体来说,我们采用一种标准的生成对抗网络(GAN)架构,其特点是两个竞争过程的相互作用:一个“生成器”,它根据视频的视觉内容生成文本句子;另一个“判别器”,它控制生成句子的准确性。判别器对生成器起到“对手”的作用,通过其控制机制帮助生成器变得更加准确。对于生成器模块,我们采用现有的使用LSTM网络的视频字幕生成概念。对于判别器,我们提出了一种专门针对视频字幕问题进行调整的新颖实现方式,它将句子和视频特征都作为输入。这就产生了我们提出的LSTM-GAN系统架构,我们通过实验表明,该架构在标准公共数据集上明显优于现有方法。