Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China.
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China; Key Laboratory of Computer Vision and System, Ministry of Education, China.
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual-audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods.
密集视频字幕旨在自动描述给定视频中发生的几个事件,大多数最先进的模型通过在未经修剪的视频中定位和描述多个事件来实现这一目标。尽管在这一领域取得了很大进展,但大多数当前的方法仅在事件定位阶段对视觉特征进行编码,而忽略了事件之间的关系,这可能会降低同一视频中描述的一致性。因此,在本研究中,我们试图利用视觉-音频线索生成事件提案,并通过捕捉它们的时间和语义关系来增强事件级别的表示。此外,为了弥补在描述过程中未能充分利用多模态信息的主要限制,我们开发了一种注意力门控机制,该机制可以动态融合和调节多模态信息。总之,我们提出了一种基于事件的多模态融合方法(EMVC)用于密集视频字幕,以捕捉事件之间的关系并有效地融合多模态信息。我们基于基准 ActivityNet Caption 和 YouCook2 数据集进行了全面的实验来评估 EMVC 的性能。实验结果表明,与最先进的方法相比,我们的模型取得了令人印象深刻的性能。