• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于事件的多模态融合方法在密集视频字幕中的应用。

Event-centric multi-modal fusion method for dense video captioning.

机构信息

Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China.

Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China; Key Laboratory of Computer Vision and System, Ministry of Education, China.

出版信息

Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.

DOI:10.1016/j.neunet.2021.11.017
PMID:34852298
Abstract

Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual-audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods.

摘要

密集视频字幕旨在自动描述给定视频中发生的几个事件,大多数最先进的模型通过在未经修剪的视频中定位和描述多个事件来实现这一目标。尽管在这一领域取得了很大进展,但大多数当前的方法仅在事件定位阶段对视觉特征进行编码,而忽略了事件之间的关系,这可能会降低同一视频中描述的一致性。因此,在本研究中,我们试图利用视觉-音频线索生成事件提案,并通过捕捉它们的时间和语义关系来增强事件级别的表示。此外,为了弥补在描述过程中未能充分利用多模态信息的主要限制,我们开发了一种注意力门控机制,该机制可以动态融合和调节多模态信息。总之,我们提出了一种基于事件的多模态融合方法(EMVC)用于密集视频字幕,以捕捉事件之间的关系并有效地融合多模态信息。我们基于基准 ActivityNet Caption 和 YouCook2 数据集进行了全面的实验来评估 EMVC 的性能。实验结果表明,与最先进的方法相比,我们的模型取得了令人印象深刻的性能。

相似文献

1
Event-centric multi-modal fusion method for dense video captioning.基于事件的多模态融合方法在密集视频字幕中的应用。
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
2
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
3
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph.基于跨模态注意力和知识增强无偏场景图的轻量级密集视频字幕
Complex Intell Systems. 2023 Feb 24:1-18. doi: 10.1007/s40747-023-00998-5.
4
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.
5
Class-dependent and cross-modal memory network considering sentimental features for video-based captioning.用于基于视频的字幕生成的考虑情感特征的类别相关和跨模态记忆网络。
Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.
6
Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.使用三向多模态深度神经网络的胎儿超声视频注视辅助自动字幕生成。
Med Image Anal. 2022 Nov;82:102630. doi: 10.1016/j.media.2022.102630. Epub 2022 Sep 17.
7
Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图
IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.
8
Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning.通过强化学习重构和表示视频内容用于字幕生成
IEEE Trans Pattern Anal Mach Intell. 2020 Dec;42(12):3088-3101. doi: 10.1109/TPAMI.2019.2920899. Epub 2020 Nov 3.
9
Corrigendum to "Event-centric Multi-modal Fusion Method for Dense Video Captioning" [Neural Networks 146 (2022) 120-129].《用于密集视频字幕的以事件为中心的多模态融合方法》的勘误[《神经网络》146(2022)120 - 129]
Neural Netw. 2022 Aug;152:527. doi: 10.1016/j.neunet.2022.05.011. Epub 2022 Jun 2.
10
Syntax Customized Video Captioning by Imitating Exemplar Sentences.通过模仿范例句子进行语法定制化视频字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):10209-10221. doi: 10.1109/TPAMI.2021.3131618. Epub 2022 Nov 7.

引用本文的文献

1
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.