用于弱监督时间语言定位的面向事件的状态对齐网络。

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

作者信息

Wu Hongzhou, Zhang Xiang, Tang Tao, Yang Canqun, Luo Zhigang

机构信息

School of Computer, National University of Defense Technology, Changsha 410073, China.

出版信息

Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.

DOI:10.3390/e26090730

PMID:39330065

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11431080/

Abstract

Weakly supervised temporal language grounding (TLG) aims to locate events in untrimmed videos based on natural language queries without temporal annotations, necessitating a deep understanding of semantic context across both video and text modalities. Existing methods often focus on simple correlations between query phrases and isolated video segments, neglecting the event-oriented semantic coherence and consistency required for accurate temporal grounding. This can lead to misleading results due to partial frame correlations. To address these limitations, we propose the Event-oriented State Alignment Network (ESAN), which constructs "start-event-end" semantic state sets for both textual and video data. ESAN employs relative entropy for cross-modal alignment through knowledge distillation from pre-trained large models, thereby enhancing semantic coherence within each modality and ensuring consistency across modalities. Our approach leverages vision-language models to extract static frame semantics and large language models to capture dynamic semantic changes, facilitating a more comprehensive understanding of events. Experiments conducted on two benchmark datasets demonstrate that ESAN significantly outperforms existing methods. By reducing false high correlations and improving the overall performance, our method effectively addresses the challenges posed by previous approaches. These advancements highlight the potential of ESAN to improve the precision and reliability of temporal language grounding tasks.

摘要

弱监督时间语言定位（TLG）旨在基于无时间标注的自然语言查询在未修剪的视频中定位事件，这需要对视频和文本模态的语义上下文有深入理解。现有方法通常侧重于查询短语与孤立视频片段之间的简单关联，而忽略了准确时间定位所需的面向事件的语义连贯和一致性。由于部分帧关联，这可能导致误导性结果。为了解决这些局限性，我们提出了面向事件的状态对齐网络（ESAN），它为文本和视频数据构建“开始-事件-结束”语义状态集。ESAN通过从预训练的大型模型进行知识蒸馏，采用相对熵进行跨模态对齐，从而增强每个模态内的语义连贯并确保跨模态的一致性。我们的方法利用视觉语言模型提取静态帧语义，并利用大型语言模型捕捉动态语义变化，有助于更全面地理解事件。在两个基准数据集上进行的实验表明，ESAN显著优于现有方法。通过减少错误的高相关性并提高整体性能，我们的方法有效解决了先前方法带来的挑战。这些进展凸显了ESAN在提高时间语言定位任务的精度和可靠性方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/c8dfa5e8b017/entropy-26-00730-g001.jpg

相似文献

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.

SDN: Semantic Decoupling Network for Temporal Language Grounding.

IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6598-6612. doi: 10.1109/TNNLS.2022.3211850. Epub 2024 May 2.

Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval.

IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.

HiSA: Hierarchically Semantic Associating for Video Temporal Grounding.

IEEE Trans Image Process. 2022;31:5178-5188. doi: 10.1109/TIP.2022.3191841. Epub 2022 Aug 4.

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.

IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.

Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):7701-7719. doi: 10.1109/TPAMI.2024.3393452. Epub 2024 Nov 6.

本文引用的文献

Video Moment Retrieval With Cross-Modal Neural Architecture Search.

IEEE Trans Image Process. 2022;31:1204-1216. doi: 10.1109/TIP.2022.3140611. Epub 2022 Jan 19.

MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval.

IEEE Trans Image Process. 2021;30:5589-5599. doi: 10.1109/TIP.2021.3086591. Epub 2021 Jun 16.

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.

IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.

Dual Encoding for Video Retrieval by Text.

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4065-4080. doi: 10.1109/TPAMI.2021.3059295. Epub 2022 Jul 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于弱监督时间语言定位的面向事件的状态对齐网络。

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献