Suppr超能文献

Visual-Language Scene-Relation-Aware Zero-Shot Captioner.

作者信息

Bao Qianyue, Liu Fang, Jiao Licheng, Liu Yang, Li Shuo, Li Lingling, Liu Xu, Chen Puhua

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Oct;47(10):8725-8739. doi: 10.1109/TPAMI.2025.3581174.

Abstract

Zero-shot image captioning can harness the knowledge of pre-trained visual language models (VLMs) and language models (LMs) to generate captions for target domain images without paired sample training. Existing methods attempt to establish high-quality connections between visual and textual modalities in text-only pre-training tasks. These methods can be divided into two perspectives: sentence-level and entity-level. Although they achieve effective performance on some metrics, they suffer from hallucinations due to biased associations during training. In this paper, we propose a scene-relation-level pre-training task by considering relations as more valuable modal connection bridges. Based on this, we construct a novel Visual-Language Scene Relation Aware Captioner (SRACap), which expands the ability to predict scene relations while generating captions for images. In addition, SRACap possesses excellent cross-domain zero-shot generalization capability, which is driven by a well-designed scene reinforcement switching pipeline. We introduce a scene policy network to dynamically crop salient regions from images and feed them into a language model to generate captions. We integrate multiple expert CLIP models to form a mixture-of-rewards module (MoR) as a reward source, and deeply optimized SRACap through the policy gradient algorithm in the zero-shot inference stage. With the iteration of scene reinforcement switching, SRACap can gradually refine the generated caption details while maintaining high semantic consistency across visual-linguistic modalities. We conduct extensive experiments on multiple standard image captioning benchmarks, showing that SRACap can accurately understand scene structures and generate high-quality text, significantly outperforming other zero-shot inference methods.

摘要

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验