Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic University, Macau 999078, China.
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
密集视频字幕是一项旨在帮助计算机通过为视频帧序列生成抽象字幕来分析视频内容的任务。然而,现有的大多数方法仅使用视频中的视觉特征,而忽略了对于理解视频也同样重要的音频特征。在本文中,我们提出了一种融合模型,该模型结合了 Transformer 框架,用于为字幕生成整合视频中的视觉和音频特征。我们使用多头注意力机制来处理我们方法中涉及的模型之间序列长度的变化。我们还引入了一个公共池来存储生成的特征,并根据置信分数对齐它们与时序,从而过滤信息并消除冗余。此外,我们使用 LSTM 作为解码器来生成描述句子,这减少了整个网络的内存大小。实验表明,我们的方法在 ActivityNet Captions 数据集上具有竞争力。