一种基于视觉语言模型的高分辨率无人机图像交通标志检测方法：以中国固原为例

A Vision-Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China.

作者信息

Yao Jianqun, Li Jinming, Li Yuxuan, Zhang Mingzhu, Zuo Chen, Dong Shi, Dai Zhe

机构信息

CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China.

School of Transportation Engineering, Chang'an University, Xi'an 710064, China.

出版信息

Sensors (Basel). 2024 Sep 6;24(17):5800. doi: 10.3390/s24175800.

DOI:10.3390/s24175800

PMID:39275711

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11398131/

Abstract

As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image processing technique is heavily reliant on image annotations. It is time consuming to build a high-quality dataset with diverse training images and human annotations. In this paper, we introduce the utilization of Vision-language Models (VLMs) in the traffic sign detection task. Without the need for discrete image labels, the rapid deployment is fulfilled by the multi-modal learning and large-scale pretrained networks. First, we compile a keyword dictionary to explain traffic signs. The Chinese national standard is used to suggest the shape and color information. Our program conducts Bootstrapping Language-image Pretraining v2 (BLIPv2) to translate representative images into text descriptions. Second, a Contrastive Language-image Pretraining (CLIP) framework is applied to characterize not only drone images but also text descriptions. Our method utilizes the pretrained encoder network to create visual features and word embeddings. Third, the category of each traffic sign is predicted according to the similarity between drone images and keywords. Cosine distance and softmax function are performed to calculate the class probability distribution. To evaluate the performance, we apply the proposed method in a practical application. The drone images captured from Guyuan, China, are employed to record the conditions of traffic signs. Further experiments include two widely used public datasets. The calculation results indicate that our vision-language model-based method has an acceptable prediction accuracy and low training cost.

摘要

作为交通系统的基本要素，交通标志被广泛用于引导交通行为。近年来，无人机已成为监测交通标志状况的重要工具。然而，现有的图像处理技术严重依赖图像标注。构建一个包含多样训练图像和人工标注的高质量数据集非常耗时。在本文中，我们介绍了视觉语言模型（VLMs）在交通标志检测任务中的应用。无需离散的图像标签，通过多模态学习和大规模预训练网络实现了快速部署。首先，我们编制了一个关键词词典来解释交通标志。采用中国国家标准来提供形状和颜色信息。我们的程序进行引导式语言-图像预训练v2（BLIPv2），将代表性图像转换为文本描述。其次，应用对比语言-图像预训练（CLIP）框架来表征无人机图像和文本描述。我们的方法利用预训练的编码器网络来创建视觉特征和词嵌入。第三，根据无人机图像与关键词之间的相似度预测每个交通标志的类别。通过余弦距离和softmax函数来计算类别概率分布。为了评估性能，我们在实际应用中应用了所提出的方法。使用从中国固原拍摄的无人机图像来记录交通标志的状况。进一步的实验包括两个广泛使用的公共数据集。计算结果表明，我们基于视觉语言模型的方法具有可接受的预测准确率和较低的训练成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e6a/11398131/b98b16c2d0b7/sensors-24-05800-g001.jpg

相似文献

A Vision-Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China.

Sensors (Basel). 2024 Sep 6;24(17):5800. doi: 10.3390/s24175800.

A modality-collaborative convolution and transformer hybrid network for unpaired multi-modal medical image segmentation with limited annotations.

Med Phys. 2023 Sep;50(9):5460-5478. doi: 10.1002/mp.16338. Epub 2023 Mar 15.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.

Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.

Real-Time and Accurate Drone Detection in a Video with a Static Background.

Sensors (Basel). 2020 Jul 10;20(14):3856. doi: 10.3390/s20143856.

IQAGPT: computed tomography image quality assessment with vision-language and ChatGPT models.

Vis Comput Ind Biomed Art. 2024 Aug 5;7(1):20. doi: 10.1186/s42492-024-00171-w.

Word self-update contrastive adversarial networks for text-to-image synthesis.

Neural Netw. 2023 Oct;167:433-444. doi: 10.1016/j.neunet.2023.08.038. Epub 2023 Aug 25.

Combining transfer learning with retinal lesion features for accurate detection of diabetic retinopathy.

Front Med (Lausanne). 2022 Nov 8;9:1050436. doi: 10.3389/fmed.2022.1050436. eCollection 2022.

A dataset for multi-sensor drone detection.

Data Brief. 2021 Oct 27;39:107521. doi: 10.1016/j.dib.2021.107521. eCollection 2021 Dec.

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.

Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.

本文引用的文献

RSDNet: A New Multiscale Rail Surface Defect Detection Model.

Sensors (Basel). 2024 Jun 1;24(11):3579. doi: 10.3390/s24113579.

A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9052-9071. doi: 10.1109/TPAMI.2024.3415112. Epub 2024 Nov 6.

Vision-Language Models for Vision Tasks: A Survey.

IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5625-5644. doi: 10.1109/TPAMI.2024.3369699. Epub 2024 Jul 2.

Recent Advances in Traffic Sign Recognition: Approaches and Datasets.

Sensors (Basel). 2023 May 11;23(10):4674. doi: 10.3390/s23104674.

Sensing and Detection of Traffic Signs Using CNNs: An Assessment on Their Performance.

Sensors (Basel). 2022 Nov 15;22(22):8830. doi: 10.3390/s22228830.

FCOS: A Simple and Strong Anchor-Free Object Detector.

IEEE Trans Pattern Anal Mach Intell. 2022 Apr;44(4):1922-1933. doi: 10.1109/TPAMI.2020.3032166. Epub 2022 Mar 4.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.

Region-Based Convolutional Networks for Accurate Object Detection and Segmentation.

IEEE Trans Pattern Anal Mach Intell. 2016 Jan;38(1):142-58. doi: 10.1109/TPAMI.2015.2437384.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于视觉语言模型的高分辨率无人机图像交通标志检测方法：以中国固原为例

A Vision-Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献