Suppr超能文献

用于零样本车辆计数的傅里叶调制CLIP

Fourier-modulated CLIP for zero-shot vehicle counting.

作者信息

Luo Yunpeng, Zhang Qin, Luo Jiawei, Huang Shixin, Wan Xiaoyu, Nie Xixi

机构信息

School of Vehicle Engineering, Chongqing Industry & Trade Polytechnic, Chongqing, 408000, China.

Department of Scientific Research, People's Hospital of Yubei District of Chongqing city, Chongqing, 401120, China.

出版信息

Sci Rep. 2025 Jun 5;15(1):19732. doi: 10.1038/s41598-025-04876-7.

Abstract

Vehicle counting is a critical task in traffic monitoring and management. Although large vision-language models have achieved significant progress in zero-shot text-image matching, adapting them for vehicle counting remains a challenging task. To tackle this issue, we propose FCLIP-VC, a Fourier-modulated Contrastive Language-image Pre-training (CLIP) for zero-shot vehicle counting. First, a Vision Transformer (ViT)-based CLIP model is employed to encode image features for precise vehicle localization. Additionally, we introduce a learnable visual prompt to avoid tuning global parameters, allowing the pre-trained CLIP knowledge to be effectively applied to dense prediction tasks. Next, for patch-level embedding of images, we introduce a patch-language contrast loss to optimize the effectiveness of the model in capturing fine-grained features of images. Then, we propose a patch-language Discrete Fourier transform (DFT) interaction module, leveraging DFT to extract multi-scale image features in the frequency domain, which improves the ability of the model to handle diverse vehicle sizes and complex traffic environments. Finally, a content-aware density map decoder is designed to generate accurate density map predictions through multi-layer convolution and progressive upsampling. Extensive experiments demonstrate that our proposed FCLIP-VC achieves state-of-the-art accuracy in zero-shot vehicle counting.

摘要

车辆计数是交通监测与管理中的一项关键任务。尽管大型视觉语言模型在零样本文本-图像匹配方面取得了显著进展,但将它们应用于车辆计数仍然是一项具有挑战性的任务。为了解决这个问题,我们提出了FCLIP-VC,一种用于零样本车辆计数的傅里叶调制对比语言-图像预训练(CLIP)方法。首先,采用基于视觉Transformer(ViT)的CLIP模型对图像特征进行编码,以实现精确的车辆定位。此外,我们引入了一个可学习的视觉提示,以避免调整全局参数,使预训练的CLIP知识能够有效地应用于密集预测任务。接下来,对于图像的补丁级嵌入,我们引入了补丁-语言对比损失,以优化模型在捕捉图像细粒度特征方面的有效性。然后,我们提出了一个补丁-语言离散傅里叶变换(DFT)交互模块,利用DFT在频域中提取多尺度图像特征,这提高了模型处理不同车辆尺寸和复杂交通环境的能力。最后,设计了一个内容感知密度图解码器,通过多层卷积和渐进上采样生成准确的密度图预测。大量实验表明,我们提出的FCLIP-VC在零样本车辆计数方面达到了当前的最高精度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cca9/12141738/bdb9c995e99f/41598_2025_4876_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验