• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视频中时态语言定位的多模态交互图卷积网络

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

作者信息

Zhang Zongmeng, Han Xianjing, Song Xuemeng, Yan Yan, Nie Liqiang

出版信息

IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.

DOI:10.1109/TIP.2021.3113791
PMID:34559652
Abstract

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.

摘要

本文聚焦于解决视频中的时间语言定位问题,该问题旨在识别未剪辑视频中自然语言句子所描述时刻的起始和结束点。然而,这并非易事,因为它不仅需要对视频和句子查询有全面的理解,还需要准确捕捉它们之间的语义对应关系。现有工作主要集中在探索视频片段和查询词之间的顺序关系以推理视频和句子查询,而忽略了其他模态内关系(例如视频片段之间的语义相似性和查询词之间的句法依存关系)。为此,在这项工作中,我们提出了一种多模态交互图卷积网络(MIGCN),它联合探索视频和句子查询中存在的复杂模态内关系和模态间交互,以促进对视频和句子查询的理解以及语义对应关系的捕捉。此外,我们设计了一种自适应上下文感知定位方法,其中将上下文信息纳入候选时刻,并设计多尺度全连接层来对不同长度的生成的粗略候选时刻的边界进行排序和调整。在Charades-STA和ActivityNet数据集上进行的大量实验证明了我们模型的良好性能和卓越效率。

相似文献

1
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.用于视频中时态语言定位的多模态交互图卷积网络
IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.
2
Text-Based Localization of Moments in a Video Corpus.视频语料库中基于文本的矩定位
IEEE Trans Image Process. 2021;30:8886-8899. doi: 10.1109/TIP.2021.3120038. Epub 2021 Oct 28.
3
Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction.通过具有查询重构的跨模态交互网络进行时刻检索
IEEE Trans Image Process. 2020 Jan 17. doi: 10.1109/TIP.2020.2965987.
4
Interaction-Integrated Network for Natural Language Moment Localization.用于自然语言时刻定位的交互集成网络。
IEEE Trans Image Process. 2021;30:2538-2548. doi: 10.1109/TIP.2021.3052086. Epub 2021 Feb 3.
5
MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval.MABAN:用于自然语言时刻检索的多代理边界感知网络。
IEEE Trans Image Process. 2021;30:5589-5599. doi: 10.1109/TIP.2021.3086591. Epub 2021 Jun 16.
6
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.用于组合式时间定位的变分跨图推理与自适应结构化语义学习
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.
7
Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language.用于基于自然语言的时刻定位的多尺度二维时间邻接网络
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9073-9087. doi: 10.1109/TPAMI.2021.3120745. Epub 2022 Nov 7.
8
Graph-Based Multi-Interaction Network for Video Question Answering.用于视频问答的基于图的多交互网络
IEEE Trans Image Process. 2021;30:2758-2770. doi: 10.1109/TIP.2021.3051756. Epub 2021 Feb 12.
9
SDN: Semantic Decoupling Network for Temporal Language Grounding.SDN:用于时态语言定位的语义解耦网络。
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6598-6612. doi: 10.1109/TNNLS.2022.3211850. Epub 2024 May 2.
10
M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization.M2DCapsN:用于自然语言时刻定位的多模态、多通道和双步胶囊网络
IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):11448-11462. doi: 10.1109/TNNLS.2023.3261927. Epub 2024 Aug 5.