• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于多特征融合的视频字幕研究。

Research on Video Captioning Based on Multifeature Fusion.

机构信息

School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China.

出版信息

Comput Intell Neurosci. 2022 Apr 28;2022:1204909. doi: 10.1155/2022/1204909. eCollection 2022.

DOI:10.1155/2022/1204909
PMID:35528356
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9071958/
Abstract

Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models.

摘要

针对现有视频字幕模型关注信息不完整和生成表达文本不够准确的问题,提出了一种融合图像、音频和运动光流的视频字幕模型。该模型使用各种大规模数据集预训练模型来提取视频帧特征、运动信息、音频特征和视频序列特征。设计了一种基于自注意力机制的嵌入层结构,用于嵌入单模态特征并学习单模态特征参数。然后,采用联合表示和协作表示两种方案融合嵌入层输出的特征向量的多模态特征,使模型能够关注视频中的不同目标及其交互关系,有效提高了视频字幕模型的性能。在 MSR-VTT 和 LSMDC 等大型数据集上进行实验,在 BLEU4、METEOR、ROUGEL 和 CIDEr 等指标下,MSR-VTT 基准数据集的得分分别为 0.443、0.327、0.619 和 0.521。结果表明,所提出的方法可以有效提高视频字幕模型的性能,与对比模型相比,评估指标有所提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/e238949d9a4b/CIN2022-1204909.011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/8ccb95c49f1d/CIN2022-1204909.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/1a45e6a9c23b/CIN2022-1204909.010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/e238949d9a4b/CIN2022-1204909.011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/8ccb95c49f1d/CIN2022-1204909.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/1a45e6a9c23b/CIN2022-1204909.010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f3f/9071958/e238949d9a4b/CIN2022-1204909.011.jpg

相似文献

1
Research on Video Captioning Based on Multifeature Fusion.基于多特征融合的视频字幕研究。
Comput Intell Neurosci. 2022 Apr 28;2022:1204909. doi: 10.1155/2022/1204909. eCollection 2022.
2
Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕
IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.
3
Video captioning based on vision transformer and reinforcement learning.基于视觉Transformer和强化学习的视频字幕
PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.
4
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.
5
Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.
6
SibNet: Sibling Convolutional Encoder for Video Captioning.SibNet:用于视频字幕的兄弟卷积编码器
IEEE Trans Pattern Anal Mach Intell. 2021 Sep;43(9):3259-3272. doi: 10.1109/TPAMI.2019.2940007. Epub 2021 Aug 4.
7
UAT: Universal Attention Transformer for Video Captioning.UAT:用于视频字幕的通用注意力转换器。
Sensors (Basel). 2022 Jun 25;22(13):4817. doi: 10.3390/s22134817.
8
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.
9
Semantic guidance network for video captioning.用于视频字幕的语义引导网络。
Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.
10
A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling.一种通过调度采样训练的语义辅助视频字幕模型。
Front Robot AI. 2020 Sep 30;7:475767. doi: 10.3389/frobt.2020.475767. eCollection 2020.

引用本文的文献

1
EdgeVidCap: A Channel-Spatial Dual-Branch Lightweight Video Captioning Model for IoT Edge Cameras.EdgeVidCap:一种用于物联网边缘摄像头的通道-空间双分支轻量级视频字幕模型
Sensors (Basel). 2025 Aug 8;25(16):4897. doi: 10.3390/s25164897.
2
Research on Online Education Resources Recommendation Based on Deep Learning.基于深度学习的在线教育资源推荐研究。
Comput Intell Neurosci. 2022 Sep 9;2022:3674271. doi: 10.1155/2022/3674271. eCollection 2022.

本文引用的文献

1
A Survey of the Usages of Deep Learning for Natural Language Processing.深度学习在自然语言处理中的应用调查。
IEEE Trans Neural Netw Learn Syst. 2021 Feb;32(2):604-624. doi: 10.1109/TNNLS.2020.2979670. Epub 2021 Feb 4.
2
Hierarchical LSTMs with Adaptive Attention for Visual Captioning.基于自适应注意力机制的分层长短时记忆网络的视觉描述生成
IEEE Trans Pattern Anal Mach Intell. 2020 May;42(5):1112-1131. doi: 10.1109/TPAMI.2019.2894139. Epub 2019 Jan 21.
3
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.