用于零样本车辆计数的傅里叶调制CLIP

Fourier-modulated CLIP for zero-shot vehicle counting.

作者信息

Luo Yunpeng, Zhang Qin, Luo Jiawei, Huang Shixin, Wan Xiaoyu, Nie Xixi

机构信息

School of Vehicle Engineering, Chongqing Industry & Trade Polytechnic, Chongqing, 408000, China.

Department of Scientific Research, People's Hospital of Yubei District of Chongqing city, Chongqing, 401120, China.

出版信息

Sci Rep. 2025 Jun 5;15(1):19732. doi: 10.1038/s41598-025-04876-7.

DOI:10.1038/s41598-025-04876-7

PMID:40473704

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12141738/

Abstract

Vehicle counting is a critical task in traffic monitoring and management. Although large vision-language models have achieved significant progress in zero-shot text-image matching, adapting them for vehicle counting remains a challenging task. To tackle this issue, we propose FCLIP-VC, a Fourier-modulated Contrastive Language-image Pre-training (CLIP) for zero-shot vehicle counting. First, a Vision Transformer (ViT)-based CLIP model is employed to encode image features for precise vehicle localization. Additionally, we introduce a learnable visual prompt to avoid tuning global parameters, allowing the pre-trained CLIP knowledge to be effectively applied to dense prediction tasks. Next, for patch-level embedding of images, we introduce a patch-language contrast loss to optimize the effectiveness of the model in capturing fine-grained features of images. Then, we propose a patch-language Discrete Fourier transform (DFT) interaction module, leveraging DFT to extract multi-scale image features in the frequency domain, which improves the ability of the model to handle diverse vehicle sizes and complex traffic environments. Finally, a content-aware density map decoder is designed to generate accurate density map predictions through multi-layer convolution and progressive upsampling. Extensive experiments demonstrate that our proposed FCLIP-VC achieves state-of-the-art accuracy in zero-shot vehicle counting.

摘要

车辆计数是交通监测与管理中的一项关键任务。尽管大型视觉语言模型在零样本文本-图像匹配方面取得了显著进展，但将它们应用于车辆计数仍然是一项具有挑战性的任务。为了解决这个问题，我们提出了FCLIP-VC，一种用于零样本车辆计数的傅里叶调制对比语言-图像预训练（CLIP）方法。首先，采用基于视觉Transformer（ViT）的CLIP模型对图像特征进行编码，以实现精确的车辆定位。此外，我们引入了一个可学习的视觉提示，以避免调整全局参数，使预训练的CLIP知识能够有效地应用于密集预测任务。接下来，对于图像的补丁级嵌入，我们引入了补丁-语言对比损失，以优化模型在捕捉图像细粒度特征方面的有效性。然后，我们提出了一个补丁-语言离散傅里叶变换（DFT）交互模块，利用DFT在频域中提取多尺度图像特征，这提高了模型处理不同车辆尺寸和复杂交通环境的能力。最后，设计了一个内容感知密度图解码器，通过多层卷积和渐进上采样生成准确的密度图预测。大量实验表明，我们提出的FCLIP-VC在零样本车辆计数方面达到了当前的最高精度。