• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于视觉Transformer和强化学习的视频字幕

Video captioning based on vision transformer and reinforcement learning.

作者信息

Zhao Hong, Chen Zhiwen, Guo Lan, Han Zeyu

机构信息

School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China.

Network & Information Center, Lanzhou University of Technology, Lanzhou, Gansu, China.

出版信息

PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.

DOI:10.7717/peerj-cs.916
PMID:35494808
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9044334/
Abstract

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

摘要

视觉特征在视频字幕中的全局编码对于提高描述准确性很重要。在本文中,我们提出了一种结合视觉Transformer(ViT)和强化学习的视频字幕方法。首先,使用Resnet-152和ResNeXt-101从视频中提取特征。其次,应用ViT网络的编码块对视频特征进行编码。第三,将编码后的特征输入到长短期记忆(LSTM)网络中以生成视频内容描述。最后,通过微调强化学习进一步提高视频内容描述的准确性。我们在用于视频字幕的基准数据集MSR-VTT上进行了实验。结果表明,与当前主流方法相比,本文中的模型在LEU-4、METEOR、ROUGE-L和CIDEr-D这四个评估指标下分别提高了2.9%、1.4%、0.9%和4.8%。

相似文献

1
Video captioning based on vision transformer and reinforcement learning.基于视觉Transformer和强化学习的视频字幕
PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.
2
UAT: Universal Attention Transformer for Video Captioning.UAT:用于视频字幕的通用注意力转换器。
Sensors (Basel). 2022 Jun 25;22(13):4817. doi: 10.3390/s22134817.
3
Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕
IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.
4
Research on Video Captioning Based on Multifeature Fusion.基于多特征融合的视频字幕研究。
Comput Intell Neurosci. 2022 Apr 28;2022:1204909. doi: 10.1155/2022/1204909. eCollection 2022.
5
Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning.通过强化学习重构和表示视频内容用于字幕生成
IEEE Trans Pattern Anal Mach Intell. 2020 Dec;42(12):3088-3101. doi: 10.1109/TPAMI.2019.2920899. Epub 2020 Nov 3.
6
Semantic guidance network for video captioning.用于视频字幕的语义引导网络。
Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.
7
Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕
IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.
8
Long Short-Term Relation Transformer With Global Gating for Video Captioning.用于视频字幕的带全局门控的长短时关系变换器
IEEE Trans Image Process. 2022;31:2726-2738. doi: 10.1109/TIP.2022.3158546. Epub 2022 Mar 29.
9
SibNet: Sibling Convolutional Encoder for Video Captioning.SibNet:用于视频字幕的兄弟卷积编码器
IEEE Trans Pattern Anal Mach Intell. 2021 Sep;43(9):3259-3272. doi: 10.1109/TPAMI.2019.2940007. Epub 2021 Aug 4.
10
Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning.交叉编解码器-解码器转换器与全局-局部视觉提取器用于医学图像字幕。
Sensors (Basel). 2022 Feb 13;22(4):1429. doi: 10.3390/s22041429.

本文引用的文献

1
Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕
IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.
2
Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning.通过强化学习重构和表示视频内容用于字幕生成
IEEE Trans Pattern Anal Mach Intell. 2020 Dec;42(12):3088-3101. doi: 10.1109/TPAMI.2019.2920899. Epub 2020 Nov 3.
3
Video Captioning by Adversarial LSTM.基于对抗长短时记忆网络的视频字幕生成
IEEE Trans Image Process. 2018 Jul 12. doi: 10.1109/TIP.2018.2855422.