• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

融合多模态特征以增强密集视频字幕。

Fusion of Multi-Modal Features to Enhance Dense Video Caption.

机构信息

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.

Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic University, Macau 999078, China.

出版信息

Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.

DOI:10.3390/s23125565
PMID:37420732
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10304565/
Abstract

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.

摘要

密集视频字幕是一项旨在帮助计算机通过为视频帧序列生成抽象字幕来分析视频内容的任务。然而,现有的大多数方法仅使用视频中的视觉特征,而忽略了对于理解视频也同样重要的音频特征。在本文中,我们提出了一种融合模型,该模型结合了 Transformer 框架,用于为字幕生成整合视频中的视觉和音频特征。我们使用多头注意力机制来处理我们方法中涉及的模型之间序列长度的变化。我们还引入了一个公共池来存储生成的特征,并根据置信分数对齐它们与时序,从而过滤信息并消除冗余。此外,我们使用 LSTM 作为解码器来生成描述句子,这减少了整个网络的内存大小。实验表明,我们的方法在 ActivityNet Captions 数据集上具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/fe2ad5fb344f/sensors-23-05565-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/885c1f488270/sensors-23-05565-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/30bdf313a1b5/sensors-23-05565-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/98d2c439fa25/sensors-23-05565-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/fe2ad5fb344f/sensors-23-05565-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/885c1f488270/sensors-23-05565-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/30bdf313a1b5/sensors-23-05565-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/98d2c439fa25/sensors-23-05565-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff90/10304565/fe2ad5fb344f/sensors-23-05565-g004.jpg

相似文献

1
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
2
Event-centric multi-modal fusion method for dense video captioning.基于事件的多模态融合方法在密集视频字幕中的应用。
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
3
Class-dependent and cross-modal memory network considering sentimental features for video-based captioning.用于基于视频的字幕生成的考虑情感特征的类别相关和跨模态记忆网络。
Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.
4
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph.基于跨模态注意力和知识增强无偏场景图的轻量级密集视频字幕
Complex Intell Systems. 2023 Feb 24:1-18. doi: 10.1007/s40747-023-00998-5.
5
Chinese Image Caption Generation via Visual Attention and Topic Modeling.基于视觉注意和主题建模的中文图像字幕生成。
IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.
6
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.
7
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.
8
CAM-RNN: Co-Attention Model Based RNN for Video Captioning.CAM-RNN:用于视频字幕的基于协同注意力模型的循环神经网络
IEEE Trans Image Process. 2019 Nov;28(11):5552-5565. doi: 10.1109/TIP.2019.2916757. Epub 2019 May 20.
9
Syntax Customized Video Captioning by Imitating Exemplar Sentences.通过模仿范例句子进行语法定制化视频字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):10209-10221. doi: 10.1109/TPAMI.2021.3131618. Epub 2022 Nov 7.
10
A Multilevel Transfer Learning Technique and LSTM Framework for Generating Medical Captions for Limited CT and DBT Images.一种用于为有限的CT和DBT图像生成医学图像说明的多级迁移学习技术和长短期记忆网络框架。
J Digit Imaging. 2022 Jun;35(3):564-580. doi: 10.1007/s10278-021-00567-7. Epub 2022 Feb 25.

本文引用的文献

1
Extendable Multiple Nodes Recurrent Tracking Framework With RTU+.带有RTU+的可扩展多节点递归跟踪框架。
IEEE Trans Image Process. 2022;31:5257-5271. doi: 10.1109/TIP.2022.3192706. Epub 2022 Aug 8.
2
Event-centric multi-modal fusion method for dense video captioning.基于事件的多模态融合方法在密集视频字幕中的应用。
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
3
SURVEY Computer Vision: the Last 50 years.调查:计算机视觉:过去的50年
Int J Parallel Emergent Distrib Syst. 2020;35(2):112-117. doi: 10.1080/17445760.2018.1469018. Epub 2018 Jun 5.
4
Micro-Lens-Based Matching for Scene Recovery in Lenslet Cameras.微透镜匹配在微透镜相机场景恢复中的应用。
IEEE Trans Image Process. 2018 Mar;27(3):1060-1075. doi: 10.1109/TIP.2017.2763823. Epub 2017 Oct 17.
5
Deep learning.深度学习。
Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.
6
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.