• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种通过调度采样训练的语义辅助视频字幕模型。

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling.

作者信息

Chen Haoran, Lin Ke, Maye Alexander, Li Jianmin, Hu Xiaolin

机构信息

The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China.

Samsung Research China, Beijing, China.

出版信息

Front Robot AI. 2020 Sep 30;7:475767. doi: 10.3389/frobt.2020.475767. eCollection 2020.

DOI:10.3389/frobt.2020.475767
PMID:33501293
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7805957/
Abstract

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, the Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, leading to poor performance. Third, current video captioning models are prone to generate relatively short captions that express video contents inappropriately. Toward resolving these three problems, we suggest three corresponding improvements. First of all, we propose a metric to compare the quality of semantic features, and utilize appropriate features as input for a semantic detection network (SDN) with adequate complexity in order to generate meaningful semantic features for videos. Then, we apply a scheduled sampling strategy that gradually transfers the training phase from a teacher-guided manner toward a more self-teaching manner. Finally, the ordinary logarithm probability loss function is leveraged by sentence length so that the inclination of generating short sentences is alleviated. Our model achieves better results than previous models on the YouTube2Text dataset and is competitive with the previous best model on the MSR-VTT dataset.

摘要

鉴于视频的特征,循环神经网络可用于自动为视频生成字幕。现有的视频字幕方法至少存在三个局限性。首先,语义信息已被广泛应用于提升视频字幕模型的性能,但现有网络往往无法提供有意义的语义特征。其次,教师强制算法经常被用于优化视频字幕模型,但在训练和推理过程中,应用了不同的策略来指导单词生成,导致性能不佳。第三,当前的视频字幕模型容易生成相对较短的字幕,无法恰当地表达视频内容。为了解决这三个问题,我们提出了三项相应的改进措施。首先,我们提出一种度量标准来比较语义特征的质量,并利用适当的特征作为具有足够复杂度的语义检测网络(SDN)的输入,以便为视频生成有意义的语义特征。然后,我们应用一种调度采样策略,该策略将训练阶段从教师指导方式逐渐转变为更自主学习的方式。最后,通过句子长度利用自然对数概率损失函数,从而减轻生成短句子的倾向。我们的模型在YouTube2Text数据集上比以前的模型取得了更好的结果,并且在MSR-VTT数据集上与之前的最佳模型具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/01e930c05462/frobt-07-475767-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/75837727288d/frobt-07-475767-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/78714c0ef2aa/frobt-07-475767-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/f91dd6398237/frobt-07-475767-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/215955fb2a34/frobt-07-475767-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/01e930c05462/frobt-07-475767-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/75837727288d/frobt-07-475767-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/78714c0ef2aa/frobt-07-475767-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/f91dd6398237/frobt-07-475767-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/215955fb2a34/frobt-07-475767-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf56/7805957/01e930c05462/frobt-07-475767-g0005.jpg

相似文献

1
A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling.一种通过调度采样训练的语义辅助视频字幕模型。
Front Robot AI. 2020 Sep 30;7:475767. doi: 10.3389/frobt.2020.475767. eCollection 2020.
2
SibNet: Sibling Convolutional Encoder for Video Captioning.SibNet:用于视频字幕的兄弟卷积编码器
IEEE Trans Pattern Anal Mach Intell. 2021 Sep;43(9):3259-3272. doi: 10.1109/TPAMI.2019.2940007. Epub 2021 Aug 4.
3
Syntax Customized Video Captioning by Imitating Exemplar Sentences.通过模仿范例句子进行语法定制化视频字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):10209-10221. doi: 10.1109/TPAMI.2021.3131618. Epub 2022 Nov 7.
4
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.
5
Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
6
Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.使用三向多模态深度神经网络的胎儿超声视频注视辅助自动字幕生成。
Med Image Anal. 2022 Nov;82:102630. doi: 10.1016/j.media.2022.102630. Epub 2022 Sep 17.
7
Semantic guidance network for video captioning.用于视频字幕的语义引导网络。
Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.
8
CAM-RNN: Co-Attention Model Based RNN for Video Captioning.CAM-RNN:用于视频字幕的基于协同注意力模型的循环神经网络
IEEE Trans Image Process. 2019 Nov;28(11):5552-5565. doi: 10.1109/TIP.2019.2916757. Epub 2019 May 20.
9
Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.通过跨模态检索和模型适配实现跨域图像字幕生成
IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.
10
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.

引用本文的文献

1
Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕
IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.

本文引用的文献

1
Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.长期递归卷积网络的视觉识别与描述。
IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):677-691. doi: 10.1109/TPAMI.2016.2599174. Epub 2016 Sep 1.
2
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.