• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于弱监督时间语言定位的面向事件的状态对齐网络。

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

作者信息

Wu Hongzhou, Zhang Xiang, Tang Tao, Yang Canqun, Luo Zhigang

机构信息

School of Computer, National University of Defense Technology, Changsha 410073, China.

出版信息

Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.

DOI:10.3390/e26090730
PMID:39330065
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11431080/
Abstract

Weakly supervised temporal language grounding (TLG) aims to locate events in untrimmed videos based on natural language queries without temporal annotations, necessitating a deep understanding of semantic context across both video and text modalities. Existing methods often focus on simple correlations between query phrases and isolated video segments, neglecting the event-oriented semantic coherence and consistency required for accurate temporal grounding. This can lead to misleading results due to partial frame correlations. To address these limitations, we propose the Event-oriented State Alignment Network (ESAN), which constructs "start-event-end" semantic state sets for both textual and video data. ESAN employs relative entropy for cross-modal alignment through knowledge distillation from pre-trained large models, thereby enhancing semantic coherence within each modality and ensuring consistency across modalities. Our approach leverages vision-language models to extract static frame semantics and large language models to capture dynamic semantic changes, facilitating a more comprehensive understanding of events. Experiments conducted on two benchmark datasets demonstrate that ESAN significantly outperforms existing methods. By reducing false high correlations and improving the overall performance, our method effectively addresses the challenges posed by previous approaches. These advancements highlight the potential of ESAN to improve the precision and reliability of temporal language grounding tasks.

摘要

弱监督时间语言定位(TLG)旨在基于无时间标注的自然语言查询在未修剪的视频中定位事件,这需要对视频和文本模态的语义上下文有深入理解。现有方法通常侧重于查询短语与孤立视频片段之间的简单关联,而忽略了准确时间定位所需的面向事件的语义连贯和一致性。由于部分帧关联,这可能导致误导性结果。为了解决这些局限性,我们提出了面向事件的状态对齐网络(ESAN),它为文本和视频数据构建“开始-事件-结束”语义状态集。ESAN通过从预训练的大型模型进行知识蒸馏,采用相对熵进行跨模态对齐,从而增强每个模态内的语义连贯并确保跨模态的一致性。我们的方法利用视觉语言模型提取静态帧语义,并利用大型语言模型捕捉动态语义变化,有助于更全面地理解事件。在两个基准数据集上进行的实验表明,ESAN显著优于现有方法。通过减少错误的高相关性并提高整体性能,我们的方法有效解决了先前方法带来的挑战。这些进展凸显了ESAN在提高时间语言定位任务的精度和可靠性方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/e8a3423adde4/entropy-26-00730-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/c8dfa5e8b017/entropy-26-00730-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/5d5964172d1c/entropy-26-00730-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/a2a22db6697a/entropy-26-00730-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/a9ff4794d1cd/entropy-26-00730-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/e8a3423adde4/entropy-26-00730-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/c8dfa5e8b017/entropy-26-00730-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/5d5964172d1c/entropy-26-00730-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/a2a22db6697a/entropy-26-00730-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/a9ff4794d1cd/entropy-26-00730-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4ad/11431080/e8a3423adde4/entropy-26-00730-g005.jpg

相似文献

1
Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.用于弱监督时间语言定位的面向事件的状态对齐网络。
Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.
2
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.用于组合式时间定位的变分跨图推理与自适应结构化语义学习
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.
3
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.用于视频中时间性句子定位的语义条件动态调制
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.
4
SDN: Semantic Decoupling Network for Temporal Language Grounding.SDN:用于时态语言定位的语义解耦网络。
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6598-6612. doi: 10.1109/TNNLS.2022.3211850. Epub 2024 May 2.
5
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval.用于跨模态视频检索的语义感知时空二进制编码
IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.
6
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.用于视频中时态语言定位的多模态交互图卷积网络
IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.
7
HiSA: Hierarchically Semantic Associating for Video Temporal Grounding.HiSA:用于视频时间定位的层次语义关联
IEEE Trans Image Process. 2022;31:5178-5188. doi: 10.1109/TIP.2022.3191841. Epub 2022 Aug 4.
8
Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.迈向教学视频中的视觉提示时间答案定位
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.
9
Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.用于弱监督时间句子定位的局部对应网络
IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.
10
Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment.通过结构化时空对齐增强视频-语言表征
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):7701-7719. doi: 10.1109/TPAMI.2024.3393452. Epub 2024 Nov 6.

本文引用的文献

1
Video Moment Retrieval With Cross-Modal Neural Architecture Search.基于跨模态神经架构搜索的视频瞬间检索
IEEE Trans Image Process. 2022;31:1204-1216. doi: 10.1109/TIP.2022.3140611. Epub 2022 Jan 19.
2
MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval.MABAN:用于自然语言时刻检索的多代理边界感知网络。
IEEE Trans Image Process. 2021;30:5589-5599. doi: 10.1109/TIP.2021.3086591. Epub 2021 Jun 16.
3
Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.用于弱监督时间句子定位的局部对应网络
IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.
4
Dual Encoding for Video Retrieval by Text.通过文本进行视频检索的双重编码
IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4065-4080. doi: 10.1109/TPAMI.2021.3059295. Epub 2022 Jul 1.