• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过文本进行视频检索的双重编码

Dual Encoding for Video Retrieval by Text.

作者信息

Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, Wang Meng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4065-4080. doi: 10.1109/TPAMI.2021.3059295. Epub 2022 Jul 1.

DOI:10.1109/TPAMI.2021.3059295
PMID:33587696
Abstract

This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space.

摘要

本文探讨了通过文本进行视频检索这一具有挑战性的问题。在这种检索范式中,终端用户通过仅以自然语言句子形式描述的即席查询来搜索未标记的视频,且不提供视觉示例。将视频视为帧序列,查询视为单词序列,有效的序列到序列跨模态匹配至关重要。为此,需要先将这两种模态编码为实值向量,然后投影到一个公共空间。在本文中,我们通过提出一种双深度编码网络来实现这一点,该网络将视频和查询编码为各自强大的密集表示。我们的创新点有两个方面。首先,与采用特定单级编码器的现有技术不同,所提出的网络执行多级编码,以从粗到细的方式表示两种模态的丰富内容。其次,与基于概念或基于潜在空间的传统公共空间学习算法不同,我们引入了混合空间学习,它结合了潜在空间的高性能和概念空间的良好可解释性。双编码在概念上简单,在实践中有效,并且通过混合空间学习进行端到端训练。在四个具有挑战性的视频数据集上进行的大量实验证明了该新方法的可行性。代码和数据可在https://github.com/danieljf24/hybrid_space获取。

相似文献

1
Dual Encoding for Video Retrieval by Text.通过文本进行视频检索的双重编码
IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4065-4080. doi: 10.1109/TPAMI.2021.3059295. Epub 2022 Jul 1.
2
End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.用于文本-视频检索的基于层次匹配和动量对比的端到端预训练
IEEE Trans Image Process. 2023;32:5017-5030. doi: 10.1109/TIP.2023.3275071. Epub 2023 Sep 8.
3
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.概念感知视频字幕:利用有效先验信息描述视频
IEEE Trans Image Process. 2023;32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.
4
Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.高效的基于令牌的图像-文本检索与一致的多模态对比训练。
IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
5
Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval.用于分层细粒度视频-文本检索的查询自适应晚期融合
IEEE Trans Neural Netw Learn Syst. 2022 Oct 24;PP. doi: 10.1109/TNNLS.2022.3214208.
6
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval.用于跨模态视频检索的语义感知时空二进制编码
IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.
7
Fine-Grained Video Retrieval With Scene Sketches.基于场景草图的细粒度视频检索。
IEEE Trans Image Process. 2023;32:3136-3149. doi: 10.1109/TIP.2023.3278474. Epub 2023 Jun 2.
8
The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval.VISIONE视频搜索系统:利用现成的文本搜索引擎进行大规模视频检索。
J Imaging. 2021 Apr 23;7(5):76. doi: 10.3390/jimaging7050076.
9
Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction.通过具有查询重构的跨模态交互网络进行时刻检索
IEEE Trans Image Process. 2020 Jan 17. doi: 10.1109/TIP.2020.2965987.
10
Dual-correlate optimized coarse-fine strategy for monocular laparoscopic videos feature matching via multilevel sequential coupling feature descriptor.基于多级序贯耦合特征描述子的单目腹腔镜视频特征匹配的双相关优化粗-精策略。
Comput Biol Med. 2024 Feb;169:107890. doi: 10.1016/j.compbiomed.2023.107890. Epub 2023 Dec 22.

引用本文的文献

1
Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.用于弱监督时间语言定位的面向事件的状态对齐网络。
Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.
2
Deep Bayesian Quantization for Supervised Neuroimage Search.用于监督神经图像搜索的深度贝叶斯量化
Mach Learn Med Imaging. 2023 Oct;14349:396-406. doi: 10.1007/978-3-031-45676-3_40. Epub 2023 Oct 15.