Suppr超能文献

Cap4Video++: Enhancing Video Understanding With Auxiliary Captions.

作者信息

Wu Wenhao, Wang Xiaohan, Luo Haipeng, Wang Jingdong, Yang Yi, Ouyang Wanli

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5223-5237. doi: 10.1109/TPAMI.2024.3410329.

Abstract

Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.

摘要

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验