• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

AAP-MIT:用于多句子视频描述的注意多孔金字塔网络和记忆整合转换器。

AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description.

出版信息

IEEE Trans Image Process. 2022;31:5559-5569. doi: 10.1109/TIP.2022.3195643. Epub 2022 Aug 26.

DOI:10.1109/TIP.2022.3195643
PMID:35994530
Abstract

Generating multi-sentence descriptions for video is considered to be the most complex task in computer vision and natural language understanding due to the intricate nature of video-text data. With the recent advances in deep learning approaches, the multi-sentence video description has achieved an impressive progress. However, learning rich temporal context representation of visual sequences and modelling long-term dependencies of natural language descriptions is still a challenging problem. Towards this goal, we propose an Attentive Atrous Pyramid network and Memory Incorporated Transformer (AAP-MIT) for multi-sentence video description. The proposed AAP-MIT incorporates the effective representation of visual scene by distilling the most informative and discriminative spatio-temporal features of video data at multiple granularities and further generates the highly summarized descriptions. Profoundly, we construct AAP-MIT with three major components: i) a temporal pyramid network, which builds the temporal feature hierarchy at multiple scales by convolving the local features at temporal space, ii) a temporal correlation attention to learn the relations among various temporal video segments, and iii) the memory incorporated transformer, which augments the new memory block in language transformer to generate highly descriptive natural language sentences. Finally, the extensive experiments on ActivityNet Captions and YouCookII datasets demonstrate the substantial superiority of AAP-MIT over the existing approaches.

摘要

生成多句视频描述被认为是计算机视觉和自然语言理解中最复杂的任务,因为视频-文本数据的复杂性。随着深度学习方法的最新进展,多句视频描述已经取得了令人瞩目的进展。然而,学习丰富的视觉序列的时间上下文表示和建模自然语言描述的长期依赖性仍然是一个具有挑战性的问题。针对这一目标,我们提出了一种用于多句视频描述的注意多孔金字塔网络和记忆集成 Transformer(AAP-MIT)。所提出的 AAP-MIT 通过在多个粒度上提取视频数据最具信息量和判别力的时空特征,有效地表示视觉场景,并进一步生成高度概括的描述。深刻地说,我们用三个主要组件构建了 AAP-MIT:i)一个时间金字塔网络,它通过在时间空间上卷积局部特征来构建多个尺度的时间特征层次结构,ii)一个时间相关注意,用于学习各种时间视频片段之间的关系,以及 iii)记忆集成 Transformer,它在语言 Transformer 中增加新的记忆块,以生成高度描述性的自然语言句子。最后,在 ActivityNet Captions 和 YouCookII 数据集上的广泛实验证明了 AAP-MIT 相对于现有方法的实质性优势。

相似文献

1
AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description.AAP-MIT:用于多句子视频描述的注意多孔金字塔网络和记忆整合转换器。
IEEE Trans Image Process. 2022;31:5559-5569. doi: 10.1109/TIP.2022.3195643. Epub 2022 Aug 26.
2
Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
3
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
4
Video captioning based on vision transformer and reinforcement learning.基于视觉Transformer和强化学习的视频字幕
PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.
5
Temporal-based Swin Transformer network for workflow recognition of surgical video.用于手术视频工作流识别的基于时间的Swin Transformer网络
Int J Comput Assist Radiol Surg. 2023 Jan;18(1):139-147. doi: 10.1007/s11548-022-02785-y. Epub 2022 Nov 4.
6
Long Short-Term Relation Transformer With Global Gating for Video Captioning.用于视频字幕的带全局门控的长短时关系变换器
IEEE Trans Image Process. 2022;31:2726-2738. doi: 10.1109/TIP.2022.3158546. Epub 2022 Mar 29.
7
Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.用于视频问答的自适应时空图增强视觉语言表示
IEEE Trans Image Process. 2021;30:5477-5489. doi: 10.1109/TIP.2021.3076556. Epub 2021 Jun 11.
8
P2T: Pyramid Pooling Transformer for Scene Understanding.P2T:用于场景理解的金字塔池化变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12760-12771. doi: 10.1109/TPAMI.2022.3202765. Epub 2023 Oct 3.
9
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks.通过分层注意力上下文增强网络实现多轮视频问答
IEEE Trans Image Process. 2019 Aug;28(8):3860-3872. doi: 10.1109/TIP.2019.2902106. Epub 2019 Feb 27.
10
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.

引用本文的文献

1
Volleyball training video classification description using the BiLSTM fusion attention mechanism.基于双向长短期记忆融合注意力机制的排球训练视频分类描述
Heliyon. 2024 Jul 16;10(15):e34735. doi: 10.1016/j.heliyon.2024.e34735. eCollection 2024 Aug 15.