• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视频字幕的分层模块化网络学习

Learning Hierarchical Modular Networks for Video Captioning.

作者信息

Li Guorong, Ye Hanhua, Qi Yuankai, Wang Shuhui, Qing Laiyun, Huang Qingming, Yang Ming-Hsuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.

DOI:10.1109/TPAMI.2023.3327677
PMID:37878438
Abstract

Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).

摘要

视频字幕旨在为给定的视频片段生成自然语言描述。现有方法主要侧重于通过预测字幕与真实文本之间逐词比较进行端到端表示学习。尽管已经取得了显著进展,但这种监督方法忽略了视觉和语言实体之间的语义对齐,这可能会对生成的字幕产生负面影响。在这项工作中,我们提出了一种分层模块化网络,在生成字幕之前从实体、动词、谓语和句子四个粒度层次上弥合视频表示与语言语义之间的差距。每个层次由一个模块实现,将相应的语义嵌入到视频表示中。此外,我们提出了一种基于字幕场景图的强化学习模块,以更好地衡量句子相似度。大量实验结果表明,在包括微软研究视频描述语料库(MSVD)、微软研究院视频到文本(MSR-VTT)和视频与文本(VATEX)在内的三个广泛使用的基准数据集上,该方法的性能优于现有最先进的模型。

相似文献

1
Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
2
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.
3
Semantic guidance network for video captioning.用于视频字幕的语义引导网络。
Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.
4
A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling.一种通过调度采样训练的语义辅助视频字幕模型。
Front Robot AI. 2020 Sep 30;7:475767. doi: 10.3389/frobt.2020.475767. eCollection 2020.
5
Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕
IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.
6
Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.
7
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.概念感知视频字幕:利用有效先验信息描述视频
IEEE Trans Image Process. 2023;32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.
8
Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.用于视频字幕和视频问答的带有辅助任务的分层表示网络
IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.
9
Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图
IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.
10
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.