School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China.
Comput Intell Neurosci. 2022 Apr 28;2022:1204909. doi: 10.1155/2022/1204909. eCollection 2022.
Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models.
针对现有视频字幕模型关注信息不完整和生成表达文本不够准确的问题,提出了一种融合图像、音频和运动光流的视频字幕模型。该模型使用各种大规模数据集预训练模型来提取视频帧特征、运动信息、音频特征和视频序列特征。设计了一种基于自注意力机制的嵌入层结构,用于嵌入单模态特征并学习单模态特征参数。然后,采用联合表示和协作表示两种方案融合嵌入层输出的特征向量的多模态特征,使模型能够关注视频中的不同目标及其交互关系,有效提高了视频字幕模型的性能。在 MSR-VTT 和 LSMDC 等大型数据集上进行实验,在 BLEU4、METEOR、ROUGEL 和 CIDEr 等指标下,MSR-VTT 基准数据集的得分分别为 0.443、0.327、0.619 和 0.521。结果表明,所提出的方法可以有效提高视频字幕模型的性能,与对比模型相比,评估指标有所提高。