Zhou Yu, Wei Yan
College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China.
Sensors (Basel). 2025 Jul 24;25(15):4582. doi: 10.3390/s25154582.
To mitigate the technical challenges associated with small-object detection, feature degradation, and spatial-contextual misalignment in UAV-acquired imagery, this paper proposes UAV-DETR, an enhanced Transformer-based object detection model designed for aerial scenarios. Specifically, UAV imagery often suffers from feature degradation due to low resolution and complex backgrounds and from semantic-spatial misalignment caused by dynamic shooting conditions. This work addresses these challenges by enhancing feature perception, semantic representation, and spatial alignment. Architecturally extending the RT-DETR framework, UAV-DETR incorporates three novel modules: the Channel-Aware Sensing Module (CAS), the Scale-Optimized Enhancement Pyramid Module (SOEP), and the newly designed Context-Spatial Alignment Module (CSAM), which integrates the functionalities of contextual and spatial calibration. These components collaboratively strengthen multi-scale feature extraction, semantic representation, and spatial-contextual alignment. The CAS module refines the backbone to improve multi-scale feature perception, while SOEP enhances semantic richness in shallow layers through lightweight channel-weighted fusion. CSAM further optimizes the hybrid encoder by simultaneously correcting contextual inconsistencies and spatial misalignments during feature fusion, enabling more precise cross-scale integration. Comprehensive comparisons with mainstream detectors, including Faster R-CNN and YOLOv5, demonstrate that UAV-DETR achieves superior small-object detection performance in complex aerial scenarios. The performance is thoroughly evaluated in terms of mAP@0.5, parameter count, and computational complexity (GFLOPs). Experiments on the VisDrone2019 dataset benchmark demonstrate that UAV-DETR achieves an mAP@0.5 of 51.6%, surpassing RT-DETR by 3.5% while reducing the number of model parameters from 19.8 million to 16.8 million.
为了缓解无人机获取图像中与小目标检测、特征退化和空间上下文错位相关的技术挑战,本文提出了UAV-DETR,这是一种专为航空场景设计的基于Transformer的增强型目标检测模型。具体而言,无人机图像常常因分辨率低和背景复杂而出现特征退化,以及因动态拍摄条件导致语义空间错位。这项工作通过增强特征感知、语义表示和空间对齐来应对这些挑战。UAV-DETR在架构上扩展了RT-DETR框架,包含三个新颖的模块:通道感知传感模块(CAS)、尺度优化增强金字塔模块(SOEP)和新设计的上下文空间对齐模块(CSAM),后者整合了上下文和空间校准的功能。这些组件协同加强多尺度特征提取、语义表示和空间上下文对齐。CAS模块优化主干以改善多尺度特征感知,而SOEP通过轻量级通道加权融合增强浅层的语义丰富度。CSAM在特征融合期间通过同时校正上下文不一致和空间错位进一步优化混合编码器,实现更精确的跨尺度整合。与主流检测器(包括Faster R-CNN和YOLOv5)的全面比较表明,UAV-DETR在复杂航空场景中实现了卓越的小目标检测性能。在mAP@0.5、参数数量和计算复杂度(GFLOPs)方面对性能进行了全面评估。在VisDrone2019数据集基准上的实验表明,UAV-DETR实现了51.6%的mAP@0.5,比RT-DETR高出3.5%,同时将模型参数数量从1980万减少到1680万。