Wang Han, Jin Lei, Wang Guangcheng, Liu Wenjie, Shi Quan, Hou Yingyan, Liu Jiali
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China.
Target Key Laboratory of Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China.
Sensors (Basel). 2025 Jun 20;25(13):3854. doi: 10.3390/s25133854.
Pedestrian detection is an important research topic in the field of visual cognition and autonomous driving systems. The proposal of the YOLO model has significantly improved the speed and accuracy of detection. To achieve full day detection performance, multimodal YOLO models based on RGB-FIR image pairs have become a research hotspot. Existing work has focused on the design of fusion modules after feature extraction of RGB and FIR branch backbone networks, achieving a multimodal backbone network framework based on back-end fusion. However, these methods overlook the complementarity and prior knowledge between modalities and scales in the front-end raw feature extraction of RGB and FIR branch backbone networks. As a result, the performance of the backend fusion framework largely depends on the representation ability of the raw features of each modality in the front-end. This paper proposes a novel RGB-FIR multimodal backbone network framework based on a cross-modality context attentional model (CCAM). Different from the existing works, a multi-level fusion framework is designed. At the front-end of the RGB-FIR parallel backbone network, the CCAM model is constructed for the raw features of each scale. The RGB-FIR feature fusion results of the lower-level features of the RGB and FIR branch backbone networks are fully utilized to optimize the spatial weight of the upper level RGB and FIR features, to achieve cross-modality and cross-scale complementarity between adjacent scale feature extraction modules. At the back-end of the RGB-FIR parallel network, a channel-space joint attention model (CBAM) and self-attention models are combined to obtain the final RGB-FIR fusion features at each scale for those RGB and FIR features optimized by CCAM. Compared with the current RGB-FIR multimodal YOLO model, comparative experiments on different performance evaluation indicators on multiple RGB-FIR public datasets indicate that this method can significantly enhance the accuracy and robustness of pedestrian detection.
行人检测是视觉认知和自动驾驶系统领域的一个重要研究课题。YOLO模型的提出显著提高了检测的速度和准确性。为了实现全天候检测性能,基于RGB-FIR图像对的多模态YOLO模型已成为研究热点。现有工作主要集中在RGB和FIR分支主干网络特征提取后的融合模块设计上,实现了基于后端融合的多模态主干网络框架。然而,这些方法忽视了RGB和FIR分支主干网络前端原始特征提取中模态和尺度之间的互补性和先验知识。因此,后端融合框架的性能在很大程度上取决于前端各模态原始特征的表示能力。本文提出了一种基于跨模态上下文注意力模型(CCAM)的新型RGB-FIR多模态主干网络框架。与现有工作不同,设计了一个多层次融合框架。在RGB-FIR并行主干网络的前端,针对每个尺度的原始特征构建CCAM模型。充分利用RGB和FIR分支主干网络较低层特征的RGB-FIR特征融合结果,优化上层RGB和FIR特征的空间权重,以实现相邻尺度特征提取模块之间的跨模态和跨尺度互补性。在RGB-FIR并行网络的后端,将通道空间联合注意力模型(CBAM)和自注意力模型相结合,为经过CCAM优化的RGB和FIR特征获取每个尺度的最终RGB-FIR融合特征。与当前的RGB-FIR多模态YOLO模型相比,在多个RGB-FIR公共数据集上针对不同性能评估指标进行的对比实验表明,该方法能够显著提高行人检测的准确性和鲁棒性。