• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

CoVR-2: Automatic Data Construction for Composed Video Retrieval.

作者信息

Ventura Lucas, Yang Antoine, Schmid Cordelia, Varol Gul

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):11409-11421. doi: 10.1109/TPAMI.2024.3463799. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3463799
PMID:39302778
Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include Composed Video Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.

摘要

相似文献

1
CoVR-2: Automatic Data Construction for Composed Video Retrieval.
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):11409-11421. doi: 10.1109/TPAMI.2024.3463799. Epub 2024 Nov 6.
2
Learning to Answer Visual Questions From Web Videos.学习从网络视频中回答视觉问题。
IEEE Trans Pattern Anal Mach Intell. 2025 May;47(5):3202-3218. doi: 10.1109/TPAMI.2022.3173208. Epub 2025 Apr 8.
3
An Ensemble of Generation- and Retrieval-based Image Captioning with Dual Generator Generative Adversarial Network.基于双生成器生成对抗网络的基于生成与检索的图像字幕集成。
IEEE Trans Image Process. 2020 Oct 15;PP. doi: 10.1109/TIP.2020.3028651.
4
Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval.
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3665-3678. doi: 10.1109/TPAMI.2023.3346434. Epub 2024 Apr 3.
5
Topic-Oriented Image Captioning Based on Order-Embedding.基于序嵌入的主题导向图像字幕生成
IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.
6
Backward induction-based deep image search.基于反向归纳的深度图像搜索。
PLoS One. 2024 Sep 9;19(9):e0310098. doi: 10.1371/journal.pone.0310098. eCollection 2024.
7
Evaluation of automatic video captioning using direct assessment.使用直接评估方法评估自动视频字幕。
PLoS One. 2018 Sep 4;13(9):e0202789. doi: 10.1371/journal.pone.0202789. eCollection 2018.
8
Cap4Video++: Enhancing Video Understanding With Auxiliary Captions.
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5223-5237. doi: 10.1109/TPAMI.2024.3410329.
9
Tasks Integrated Networks: Joint Detection and Retrieval for Image Search.任务集成网络:图像搜索的联合检测与检索。
IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):456-473. doi: 10.1109/TPAMI.2020.3009758. Epub 2021 Dec 7.
10
Learning to Overcome Noise in Weak Caption Supervision for Object Detection.学习在弱字幕监督下克服目标检测中的噪声
IEEE Trans Pattern Anal Mach Intell. 2023 Apr;45(4):4897-4914. doi: 10.1109/TPAMI.2022.3187350. Epub 2023 Mar 7.