Xiong Haitao, Zhou Yuchen, Liu Jiaming, Cai Yuanyuan
School of International Economics and Management, Beijing Technology and Business University, Beijing, China.
National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing, China.
Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.
The video-based commonsense captioning task aims to add multiple commonsense descriptions to video captions to understand video content better. This paper aims to consider the importance of cross-modal mapping. We propose a combined framework called Class-dependent and Cross-modal Memory Network considering SENtimental features (CCMN-SEN) for Video-based Captioning to enhance commonsense caption generation. Firstly, we develop class-dependent memory for recording the alignment between video features and text. It only allows cross-modal interactions and generation on cross-modal matrices that share the same labels. Then, to understand the sentiments conveyed in the videos and generate accurate captions, we add sentiment features to facilitate commonsense caption generation. Experiment results demonstrate that our proposed CCMN-SEN significantly outperforms the state-of-the-art methods. These results have practical significance for understanding video content better.
基于视频的常识性字幕任务旨在为视频字幕添加多个常识性描述,以便更好地理解视频内容。本文旨在考虑跨模态映射的重要性。我们提出了一个名为基于类和跨模态记忆网络并考虑情感特征(CCMN-SEN)的组合框架,用于基于视频的字幕生成,以增强常识性字幕生成。首先,我们开发了基于类的记忆,用于记录视频特征和文本之间的对齐。它只允许在共享相同标签的跨模态矩阵上进行跨模态交互和生成。然后,为了理解视频中传达的情感并生成准确的字幕,我们添加情感特征以促进常识性字幕生成。实验结果表明,我们提出的CCMN-SEN显著优于现有方法。这些结果对于更好地理解视频内容具有实际意义。