Luo Yunpeng, Zhang Qin, Luo Jiawei, Huang Shixin, Wan Xiaoyu, Nie Xixi
School of Vehicle Engineering, Chongqing Industry & Trade Polytechnic, Chongqing, 408000, China.
Department of Scientific Research, People's Hospital of Yubei District of Chongqing city, Chongqing, 401120, China.
Sci Rep. 2025 Jun 5;15(1):19732. doi: 10.1038/s41598-025-04876-7.
Vehicle counting is a critical task in traffic monitoring and management. Although large vision-language models have achieved significant progress in zero-shot text-image matching, adapting them for vehicle counting remains a challenging task. To tackle this issue, we propose FCLIP-VC, a Fourier-modulated Contrastive Language-image Pre-training (CLIP) for zero-shot vehicle counting. First, a Vision Transformer (ViT)-based CLIP model is employed to encode image features for precise vehicle localization. Additionally, we introduce a learnable visual prompt to avoid tuning global parameters, allowing the pre-trained CLIP knowledge to be effectively applied to dense prediction tasks. Next, for patch-level embedding of images, we introduce a patch-language contrast loss to optimize the effectiveness of the model in capturing fine-grained features of images. Then, we propose a patch-language Discrete Fourier transform (DFT) interaction module, leveraging DFT to extract multi-scale image features in the frequency domain, which improves the ability of the model to handle diverse vehicle sizes and complex traffic environments. Finally, a content-aware density map decoder is designed to generate accurate density map predictions through multi-layer convolution and progressive upsampling. Extensive experiments demonstrate that our proposed FCLIP-VC achieves state-of-the-art accuracy in zero-shot vehicle counting.
车辆计数是交通监测与管理中的一项关键任务。尽管大型视觉语言模型在零样本文本-图像匹配方面取得了显著进展,但将它们应用于车辆计数仍然是一项具有挑战性的任务。为了解决这个问题,我们提出了FCLIP-VC,一种用于零样本车辆计数的傅里叶调制对比语言-图像预训练(CLIP)方法。首先,采用基于视觉Transformer(ViT)的CLIP模型对图像特征进行编码,以实现精确的车辆定位。此外,我们引入了一个可学习的视觉提示,以避免调整全局参数,使预训练的CLIP知识能够有效地应用于密集预测任务。接下来,对于图像的补丁级嵌入,我们引入了补丁-语言对比损失,以优化模型在捕捉图像细粒度特征方面的有效性。然后,我们提出了一个补丁-语言离散傅里叶变换(DFT)交互模块,利用DFT在频域中提取多尺度图像特征,这提高了模型处理不同车辆尺寸和复杂交通环境的能力。最后,设计了一个内容感知密度图解码器,通过多层卷积和渐进上采样生成准确的密度图预测。大量实验表明,我们提出的FCLIP-VC在零样本车辆计数方面达到了当前的最高精度。