Suppr超能文献

用于弱监督时间语言定位的面向事件的状态对齐网络。

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

作者信息

Wu Hongzhou, Zhang Xiang, Tang Tao, Yang Canqun, Luo Zhigang

机构信息

School of Computer, National University of Defense Technology, Changsha 410073, China.

出版信息

Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.

Abstract

Weakly supervised temporal language grounding (TLG) aims to locate events in untrimmed videos based on natural language queries without temporal annotations, necessitating a deep understanding of semantic context across both video and text modalities. Existing methods often focus on simple correlations between query phrases and isolated video segments, neglecting the event-oriented semantic coherence and consistency required for accurate temporal grounding. This can lead to misleading results due to partial frame correlations. To address these limitations, we propose the Event-oriented State Alignment Network (ESAN), which constructs "start-event-end" semantic state sets for both textual and video data. ESAN employs relative entropy for cross-modal alignment through knowledge distillation from pre-trained large models, thereby enhancing semantic coherence within each modality and ensuring consistency across modalities. Our approach leverages vision-language models to extract static frame semantics and large language models to capture dynamic semantic changes, facilitating a more comprehensive understanding of events. Experiments conducted on two benchmark datasets demonstrate that ESAN significantly outperforms existing methods. By reducing false high correlations and improving the overall performance, our method effectively addresses the challenges posed by previous approaches. These advancements highlight the potential of ESAN to improve the precision and reliability of temporal language grounding tasks.

摘要

弱监督时间语言定位(TLG)旨在基于无时间标注的自然语言查询在未修剪的视频中定位事件,这需要对视频和文本模态的语义上下文有深入理解。现有方法通常侧重于查询短语与孤立视频片段之间的简单关联,而忽略了准确时间定位所需的面向事件的语义连贯和一致性。由于部分帧关联,这可能导致误导性结果。为了解决这些局限性,我们提出了面向事件的状态对齐网络(ESAN),它为文本和视频数据构建“开始-事件-结束”语义状态集。ESAN通过从预训练的大型模型进行知识蒸馏,采用相对熵进行跨模态对齐,从而增强每个模态内的语义连贯并确保跨模态的一致性。我们的方法利用视觉语言模型提取静态帧语义,并利用大型语言模型捕捉动态语义变化,有助于更全面地理解事件。在两个基准数据集上进行的实验表明,ESAN显著优于现有方法。通过减少错误的高相关性并提高整体性能,我们的方法有效解决了先前方法带来的挑战。这些进展凸显了ESAN在提高时间语言定位任务的精度和可靠性方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/c8dfa5e8b017/entropy-26-00730-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验