Zhang Xude, Ou Weihua, Wu Xiaoping, Zhang Changzhen
Engineering Research Center of Micro-Nano and Intelligent Manufacturing, Ministry of Education, Kaili University, Kaili, Guizhou, China.
College of Microelectronics and Artificial Intelligence, Kaili University, Kaili, Guizhou, China.
PLoS One. 2025 Jun 30;20(6):e0325962. doi: 10.1371/journal.pone.0325962. eCollection 2025.
With the rapid development of intelligent transportation systems, especially in traffic image detection tasks, the introduction of the transformer architecture greatly promotes the improvement of model performance. However, traditional transformer models have high computational costs during training and deployment due to the quadratic complexity of their self-attention mechanism, which limits their application in resource-constrained environments. To overcome this limitation, this paper proposes a novel hybrid architecture, Mamba Hybrid Self-Attention Vision Transformers (MHS-VIT), which combines the advantages of Mamba state-space model (SSM) and transformer to improve the modeling efficiency and performance of visual tasks and to enhance the modeling efficiency and accuracy of the model in processing traffic images. Mamba, as a linear time complexity SSM, can effectively reduce the computational burden without sacrificing performance. The self-attention mechanism of the transformer is good at capturing long-distance spatial dependencies in images, which is crucial for understanding complex traffic scenes. Experimental results showed that MHS-VIT exhibited excellent performances in traffic image detection tasks. Whether it is vehicle detection, pedestrian detection, or traffic sign recognition tasks, this model could accurately and quickly identify target objects. Compared with backbone networks of the same scale, MHS-VIT achieved significant improvements in accuracy and model parameter quantity.
随着智能交通系统的快速发展,尤其是在交通图像检测任务中,Transformer架构的引入极大地推动了模型性能的提升。然而,传统的Transformer模型由于其自注意力机制的二次复杂性,在训练和部署过程中计算成本较高,这限制了它们在资源受限环境中的应用。为了克服这一限制,本文提出了一种新颖的混合架构,即曼巴混合自注意力视觉Transformer(MHS-VIT),它结合了曼巴状态空间模型(SSM)和Transformer的优点,以提高视觉任务的建模效率和性能,并增强模型在处理交通图像时的建模效率和准确性。曼巴作为一种具有线性时间复杂度的SSM,可以在不牺牲性能的情况下有效减轻计算负担。Transformer的自注意力机制擅长捕捉图像中的长距离空间依赖性,这对于理解复杂的交通场景至关重要。实验结果表明,MHS-VIT在交通图像检测任务中表现出色。无论是车辆检测、行人检测还是交通标志识别任务,该模型都能准确快速地识别目标物体。与相同规模的骨干网络相比,MHS-VIT在准确性和模型参数量方面都取得了显著提升。