Yao Jianqun, Li Jinming, Li Yuxuan, Zhang Mingzhu, Zuo Chen, Dong Shi, Dai Zhe
CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China.
School of Transportation Engineering, Chang'an University, Xi'an 710064, China.
Sensors (Basel). 2024 Sep 6;24(17):5800. doi: 10.3390/s24175800.
As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image processing technique is heavily reliant on image annotations. It is time consuming to build a high-quality dataset with diverse training images and human annotations. In this paper, we introduce the utilization of Vision-language Models (VLMs) in the traffic sign detection task. Without the need for discrete image labels, the rapid deployment is fulfilled by the multi-modal learning and large-scale pretrained networks. First, we compile a keyword dictionary to explain traffic signs. The Chinese national standard is used to suggest the shape and color information. Our program conducts Bootstrapping Language-image Pretraining v2 (BLIPv2) to translate representative images into text descriptions. Second, a Contrastive Language-image Pretraining (CLIP) framework is applied to characterize not only drone images but also text descriptions. Our method utilizes the pretrained encoder network to create visual features and word embeddings. Third, the category of each traffic sign is predicted according to the similarity between drone images and keywords. Cosine distance and softmax function are performed to calculate the class probability distribution. To evaluate the performance, we apply the proposed method in a practical application. The drone images captured from Guyuan, China, are employed to record the conditions of traffic signs. Further experiments include two widely used public datasets. The calculation results indicate that our vision-language model-based method has an acceptable prediction accuracy and low training cost.
作为交通系统的基本要素,交通标志被广泛用于引导交通行为。近年来,无人机已成为监测交通标志状况的重要工具。然而,现有的图像处理技术严重依赖图像标注。构建一个包含多样训练图像和人工标注的高质量数据集非常耗时。在本文中,我们介绍了视觉语言模型(VLMs)在交通标志检测任务中的应用。无需离散的图像标签,通过多模态学习和大规模预训练网络实现了快速部署。首先,我们编制了一个关键词词典来解释交通标志。采用中国国家标准来提供形状和颜色信息。我们的程序进行引导式语言-图像预训练v2(BLIPv2),将代表性图像转换为文本描述。其次,应用对比语言-图像预训练(CLIP)框架来表征无人机图像和文本描述。我们的方法利用预训练的编码器网络来创建视觉特征和词嵌入。第三,根据无人机图像与关键词之间的相似度预测每个交通标志的类别。通过余弦距离和softmax函数来计算类别概率分布。为了评估性能,我们在实际应用中应用了所提出的方法。使用从中国固原拍摄的无人机图像来记录交通标志的状况。进一步的实验包括两个广泛使用的公共数据集。计算结果表明,我们基于视觉语言模型的方法具有可接受的预测准确率和较低的训练成本。