MAPE-ViT：基于新型小波增强视觉Transformer的多模态场景理解

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer.

作者信息

Ahmed Muhammad Waqas, Sadiq Touseef, Rahman Hameedur, Alateyah Sulaiman Abdullah, Alnusayri Mohammed, Alatiyyah Mohammed, AlHammadi Dina Abdulaziz

机构信息

Department of Computer Science, Air University, Islamabad, Pakistan.

Centre for Artificial Intelligence Research, Department of Information and Communication Technology, University of Agder, Grimstad, Norway.

出版信息

PeerJ Comput Sci. 2025 May 23;11:e2796. doi: 10.7717/peerj-cs.2796. eCollection 2025.

DOI:10.7717/peerj-cs.2796

PMID:40567749

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12190338/

Abstract

This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

摘要

本文介绍了基于视觉Transformer的多模态自适应补丁嵌入（MAPE-ViT），这是一种用于RGB-D场景分类的新颖方法，有效解决了传感器未对准、深度噪声和物体边界保留等基本挑战。我们的框架将最大稳定极值区域（MSER）与小波系数相结合，以创建捕获局部和全局图像特征的综合补丁嵌入。这些由MSER引导的补丁，结合原始像素和多尺度小波信息，作为视觉Transformer的输入，该Transformer利用其注意力机制提取高级语义特征。通过使用灰狼算法进行优化，进一步增强了特征辨别能力。处理后的特征随后流入双流架构，其中极限学习机处理多目标分类，而条件随机场（CRF）管理场景级分类。大量实验结果证明了我们方法的有效性，与现有方法相比，分类准确率有显著提高。我们的系统为RGB-D场景理解提供了一个强大的解决方案，特别是在传统方法难以处理传感器伪像和噪声的具有挑战性的条件下。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

MAPE-ViT：基于新型小波增强视觉Transformer的多模态场景理解

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

MAPE-ViT：基于新型小波增强视觉Transformer的多模态场景理解

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer.

作者信息

机构信息

出版信息

相似文献

本文引用的文献