Liu Sheng, Ren Zhou, Yuan Junsong
IEEE Trans Pattern Anal Mach Intell. 2021 Sep;43(9):3259-3272. doi: 10.1109/TPAMI.2019.2940007. Epub 2021 Aug 4.
Visual captioning, the task of describing an image or a video using one or few sentences, is a challenging task owing to the complexity of understanding the copious visual information and describing it using natural language. Motivated by the success of applying neural networks for machine translation, previous work applies sequence to sequence learning to translate videos into sentences. In this work, different from previous work that encodes visual information using a single flow, we introduce a novel Sibling Convolutional Encoder (SibNet) for visual captioning, which employs a dual-branch architecture to collaboratively encode videos. The first content branch encodes visual content information of the video with an autoencoder, capturing the visual appearance information of the video as other networks often do. While the second semantic branch encodes semantic information of the video via visual-semantic joint embedding, which brings complementary representation by considering the semantics when extracting features from videos. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed model can better represent the rich information in videos. To validate the advantages of the proposed model, we conduct experiments on two benchmarks for video captioning, YouTube2Text and MSR-VTT. Our results demonstrate that the proposed SibNet consistently outperforms existing methods across different evaluation metrics.
视觉字幕任务是使用一两个句子来描述图像或视频,由于理解大量视觉信息并使用自然语言进行描述的复杂性,这是一项具有挑战性的任务。受神经网络在机器翻译中取得成功的启发,先前的工作应用序列到序列学习将视频翻译成句子。在这项工作中,与之前使用单一流程对视觉信息进行编码的工作不同,我们引入了一种用于视觉字幕的新型兄弟卷积编码器(SibNet),它采用双分支架构来协同编码视频。第一个内容分支使用自动编码器对视频的视觉内容信息进行编码,像其他网络通常那样捕捉视频的视觉外观信息。而第二个语义分支通过视觉语义联合嵌入对视频的语义信息进行编码,在从视频中提取特征时考虑语义,从而带来互补表示。然后,两个分支通过软注意力机制有效地结合起来,最后输入到循环神经网络(RNN)解码器中生成字幕。通过我们的SibNet明确捕捉内容和语义信息,所提出的模型能够更好地表示视频中的丰富信息。为了验证所提出模型的优势,我们在两个视频字幕基准数据集YouTube2Text和MSR-VTT上进行了实验。我们的结果表明,所提出的SibNet在不同评估指标上始终优于现有方法。