Suppr超能文献

Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization.

作者信息

Muksimova Shakhnoza, Umirzakova Sabina, Sultanov Murodjon, Cho Young Im

机构信息

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of Korea.

Department of Information Systems and Technologies of the Tashkent State University of Economic, Tashkent 100066, Uzbekistan.

出版信息

Sensors (Basel). 2025 Jan 24;25(3):707. doi: 10.3390/s25030707.

Abstract

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks.

摘要
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8fa0/11821176/8ad0795a232c/sensors-25-00707-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验