• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于时空视觉Transformer 的视频摘要

Video Summarization With Spatiotemporal Vision Transformer.

出版信息

IEEE Trans Image Process. 2023;32:3013-3026. doi: 10.1109/TIP.2023.3275069. Epub 2023 May 26.

DOI:10.1109/TIP.2023.3275069
PMID:37186532
Abstract

Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.

摘要

视频摘要旨在为高效的视频浏览生成原始视频的精简摘要。为了提供符合人类感知且包含重要内容的视频摘要,提出了基于监督学习的视频摘要方法。这些方法旨在根据人工创建的摘要的连续帧信息学习重要内容。然而,最近的方法很少同时考虑非相邻帧之间的帧间相关性以及吸引人类注意力的帧内注意力,以表示帧的重要性。为了解决这些问题,我们提出了一种名为时空视觉Transformer(STVT)的新型基于 Transformer 的方法,用于视频摘要。STVT 由三个主要组件组成,包括嵌入式序列模块、时间帧间注意力(TIA)编码器和空间帧内注意力(SIA)编码器。嵌入式序列模块通过融合帧嵌入、索引嵌入和段类嵌入来生成嵌入式序列,以表示帧。TIA 编码器使用多头自注意力机制学习非相邻帧之间的时间帧间相关性。然后,SIA 编码器学习每个帧的空间帧内注意力。最后,计算多帧损失以驱动网络以端到端可训练的方式进行学习。通过同时使用帧间和帧内信息,我们的方法在 SumMe 和 TVSum 数据集上均优于最先进的方法。时空视觉 Transformer 的源代码将在 https://github.com/nchucvml/STVT 上提供。

相似文献

1
Video Summarization With Spatiotemporal Vision Transformer.基于时空视觉Transformer 的视频摘要
IEEE Trans Image Process. 2023;32:3013-3026. doi: 10.1109/TIP.2023.3275069. Epub 2023 May 26.
2
Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation.Interp-SUM:基于分段线性插值的无监督视频摘要。
Sensors (Basel). 2021 Jul 2;21(13):4562. doi: 10.3390/s21134562.
3
An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition.一种用于动作识别的具有同步时空和空间自注意力的高效视频变换器。
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2496-2509. doi: 10.1109/TNNLS.2022.3190367. Epub 2024 Feb 5.
4
Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization.释放对比学习在零样本视频摘要中的力量。
J Imaging. 2024 Sep 14;10(9):229. doi: 10.3390/jimaging10090229.
5
Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism.使用带有注意力机制的变换器双向编码器表示的多模态抽象摘要
Heliyon. 2024 Feb 18;10(4):e26162. doi: 10.1016/j.heliyon.2024.e26162. eCollection 2024 Feb 29.
6
Unsupervised Low-Light Video Enhancement With Spatial-Temporal Co-Attention Transformer.基于时空协同注意力Transformer的无监督低光照视频增强
IEEE Trans Image Process. 2023;32:4701-4715. doi: 10.1109/TIP.2023.3301332. Epub 2023 Aug 16.
7
Unsupervised Video Summarization Based on Deep Reinforcement Learning with Interpolation.基于深度强化学习与插值的无监督视频摘要。
Sensors (Basel). 2023 Mar 23;23(7):3384. doi: 10.3390/s23073384.
8
AudioVisual Video Summarization.视听视频摘要
IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):5181-5188. doi: 10.1109/TNNLS.2021.3119969. Epub 2023 Aug 4.
9
Structure and Sequence Aligned Code Summarization with Prefix and Suffix Balanced Strategy.采用前缀和后缀平衡策略的结构与序列对齐代码摘要
Entropy (Basel). 2023 Mar 26;25(4):570. doi: 10.3390/e25040570.
10
DSNet: A Flexible Detect-to-Summarize Network for Video Summarization.DSNet:一种用于视频摘要的灵活检测到摘要网络。
IEEE Trans Image Process. 2021;30:948-962. doi: 10.1109/TIP.2020.3039886. Epub 2020 Dec 8.