• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MAPE-ViT:基于新型小波增强视觉Transformer的多模态场景理解

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer.

作者信息

Ahmed Muhammad Waqas, Sadiq Touseef, Rahman Hameedur, Alateyah Sulaiman Abdullah, Alnusayri Mohammed, Alatiyyah Mohammed, AlHammadi Dina Abdulaziz

机构信息

Department of Computer Science, Air University, Islamabad, Pakistan.

Centre for Artificial Intelligence Research, Department of Information and Communication Technology, University of Agder, Grimstad, Norway.

出版信息

PeerJ Comput Sci. 2025 May 23;11:e2796. doi: 10.7717/peerj-cs.2796. eCollection 2025.

DOI:10.7717/peerj-cs.2796
PMID:40567749
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12190338/
Abstract

This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

摘要

本文介绍了基于视觉Transformer的多模态自适应补丁嵌入(MAPE-ViT),这是一种用于RGB-D场景分类的新颖方法,有效解决了传感器未对准、深度噪声和物体边界保留等基本挑战。我们的框架将最大稳定极值区域(MSER)与小波系数相结合,以创建捕获局部和全局图像特征的综合补丁嵌入。这些由MSER引导的补丁,结合原始像素和多尺度小波信息,作为视觉Transformer的输入,该Transformer利用其注意力机制提取高级语义特征。通过使用灰狼算法进行优化,进一步增强了特征辨别能力。处理后的特征随后流入双流架构,其中极限学习机处理多目标分类,而条件随机场(CRF)管理场景级分类。大量实验结果证明了我们方法的有效性,与现有方法相比,分类准确率有显著提高。我们的系统为RGB-D场景理解提供了一个强大的解决方案,特别是在传统方法难以处理传感器伪像和噪声的具有挑战性的条件下。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/e076afcf94b4/peerj-cs-11-2796-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/e46dac7a119d/peerj-cs-11-2796-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/4e7b4dba3a44/peerj-cs-11-2796-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/49f0ae92f52e/peerj-cs-11-2796-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/437874a7b829/peerj-cs-11-2796-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/bdc3131164fa/peerj-cs-11-2796-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/e076afcf94b4/peerj-cs-11-2796-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/e46dac7a119d/peerj-cs-11-2796-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/4e7b4dba3a44/peerj-cs-11-2796-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/49f0ae92f52e/peerj-cs-11-2796-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/437874a7b829/peerj-cs-11-2796-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/bdc3131164fa/peerj-cs-11-2796-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76e/12190338/e076afcf94b4/peerj-cs-11-2796-g006.jpg

相似文献

1
MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer.MAPE-ViT:基于新型小波增强视觉Transformer的多模态场景理解
PeerJ Comput Sci. 2025 May 23;11:e2796. doi: 10.7717/peerj-cs.2796. eCollection 2025.
2
A deep learning approach to direct immunofluorescence pattern recognition in autoimmune bullous diseases.深度学习方法在自身免疫性大疱性疾病中的直接免疫荧光模式识别。
Br J Dermatol. 2024 Jul 16;191(2):261-266. doi: 10.1093/bjd/ljae142.
3
Advancing respiratory disease diagnosis: A deep learning and vision transformer-based approach with a novel X-ray dataset.推进呼吸系统疾病诊断:一种基于深度学习和视觉Transformer的方法及新型X射线数据集
Comput Biol Med. 2025 Aug;194:110501. doi: 10.1016/j.compbiomed.2025.110501. Epub 2025 Jun 9.
4
Exploring the Potential of Electroencephalography Signal-Based Image Generation Using Diffusion Models: Integrative Framework Combining Mixed Methods and Multimodal Analysis.利用扩散模型探索基于脑电图信号的图像生成潜力:结合混合方法和多模态分析的综合框架
JMIR Med Inform. 2025 Jun 25;13:e72027. doi: 10.2196/72027.
5
A Deep Neural Network Framework for Dynamic Two-Handed Indian Sign Language Recognition in Hearing and Speech-Impaired Communities.用于听力和言语障碍社区动态双手印度手语识别的深度神经网络框架
Sensors (Basel). 2025 Jun 11;25(12):3652. doi: 10.3390/s25123652.
6
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.
7
Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果:一种针对特定个体见解的新型验证方法。
Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.
8
ICAFormer: An Image Dehazing Transformer Based on Interactive Channel Attention.ICAFormer:一种基于交互式通道注意力的图像去雾Transformer
Sensors (Basel). 2025 Jun 15;25(12):3750. doi: 10.3390/s25123750.
9
Recognizing American Sign Language gestures efficiently and accurately using a hybrid transformer model.使用混合变压器模型高效准确地识别美国手语手势。
Sci Rep. 2025 Jun 23;15(1):20253. doi: 10.1038/s41598-025-06344-8.
10
TLTNet: A novel transscale cascade layered transformer network for enhanced retinal blood vessel segmentation.TLTNet:一种新颖的跨尺度级联分层Transformer 网络,用于增强视网膜血管分割。
Comput Biol Med. 2024 Aug;178:108773. doi: 10.1016/j.compbiomed.2024.108773. Epub 2024 Jun 25.

本文引用的文献

1
CDNet: Complementary Depth Network for RGB-D Salient Object Detection.CDNet:用于RGB-D显著目标检测的互补深度网络。
IEEE Trans Image Process. 2021;30:3376-3390. doi: 10.1109/TIP.2021.3060167. Epub 2021 Mar 9.
2
ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition.问:为RGB-D场景识别自适应选择关键局部特征。
IEEE Trans Image Process. 2021;30:2722-2733. doi: 10.1109/TIP.2021.3053459. Epub 2021 Feb 10.
3
RGB-D Object Recognition Using Multi-Modal Deep Neural Network and DS Evidence Theory.基于多模态深度神经网络和证据理论的 RGB-D 目标识别。
Sensors (Basel). 2019 Jan 27;19(3):529. doi: 10.3390/s19030529.
4
Learning Effective RGB-D Representations for Scene Recognition.学习用于场景识别的有效RGB-D表示。
IEEE Trans Image Process. 2018 Sep 28. doi: 10.1109/TIP.2018.2872629.
5
A semantic autonomous video surveillance system for dense camera networks in Smart Cities.一种用于智慧城市中密集型摄像机网络的语义自主视频监控系统。
Sensors (Basel). 2012;12(8):10407-29. doi: 10.3390/s120810407. Epub 2012 Aug 2.