Zhao Hong, Chen Zhiwen, Guo Lan, Han Zeyu
School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China.
Network & Information Center, Lanzhou University of Technology, Lanzhou, Gansu, China.
PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
视觉特征在视频字幕中的全局编码对于提高描述准确性很重要。在本文中,我们提出了一种结合视觉Transformer(ViT)和强化学习的视频字幕方法。首先,使用Resnet-152和ResNeXt-101从视频中提取特征。其次,应用ViT网络的编码块对视频特征进行编码。第三,将编码后的特征输入到长短期记忆(LSTM)网络中以生成视频内容描述。最后,通过微调强化学习进一步提高视频内容描述的准确性。我们在用于视频字幕的基准数据集MSR-VTT上进行了实验。结果表明,与当前主流方法相比,本文中的模型在LEU-4、METEOR、ROUGE-L和CIDEr-D这四个评估指标下分别提高了2.9%、1.4%、0.9%和4.8%。