• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

P2T:用于场景理解的金字塔池化变换器

P2T: Pyramid Pooling Transformer for Scene Understanding.

作者信息

Wu Yu-Huan, Liu Yun, Zhan Xin, Cheng Ming-Ming

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12760-12771. doi: 10.1109/TPAMI.2022.3202765. Epub 2023 Oct 3.

DOI:10.1109/TPAMI.2022.3202765
PMID:36040936
Abstract

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

摘要

最近,视觉Transformer通过推动各种视觉任务的技术前沿取得了巨大成功。视觉Transformer中最具挑战性的问题之一是图像令牌的长序列长度导致高计算成本(二次复杂度)。解决这个问题的一个流行方法是使用单个池化操作来减少序列长度。本文考虑如何改进现有的视觉Transformer,其中通过单个池化操作提取的池化特征似乎不够强大。为此,我们注意到金字塔池化由于其在上下文抽象方面的强大能力,已被证明在各种视觉任务中有效。然而,金字塔池化尚未在骨干网络设计中得到探索。为了弥补这一差距,我们建议在视觉Transformer中将金字塔池化应用于多头自注意力(MHSA),同时减少序列长度并捕获强大的上下文特征。通过我们基于池化的MHSA,我们构建了一个通用的视觉Transformer骨干网络,称为金字塔池化Transformer(P2T)。大量实验表明,当将P2T用作骨干网络时,与先前基于CNN和Transformer的网络相比,它在图像分类、语义分割、目标检测和实例分割等各种视觉任务中表现出显著优势。代码将在https://github.com/yuhuan-wu/P2T上发布。

相似文献

1
P2T: Pyramid Pooling Transformer for Scene Understanding.P2T:用于场景理解的金字塔池化变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12760-12771. doi: 10.1109/TPAMI.2022.3202765. Epub 2023 Oct 3.
2
PPNet: Pyramid pooling based network for polyp segmentation.PPNet:基于金字塔池化的息肉分割网络。
Comput Biol Med. 2023 Jun;160:107028. doi: 10.1016/j.compbiomed.2023.107028. Epub 2023 May 10.
3
DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.DiagSWin:一种具有对角线形状窗口的多尺度视觉转换器,用于目标检测和分割。
Neural Netw. 2024 Dec;180:106653. doi: 10.1016/j.neunet.2024.106653. Epub 2024 Aug 22.
4
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network With Token Migration.快速iTPN:带令牌迁移的整体预训练变压器金字塔网络
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9766-9779. doi: 10.1109/TPAMI.2024.3429508. Epub 2024 Nov 6.
5
Vicinity Vision Transformer.邻近视觉变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12635-12649. doi: 10.1109/TPAMI.2023.3285569. Epub 2023 Sep 5.
6
Fully transformer network for skin lesion analysis.基于全Transformer 网络的皮肤病变分析
Med Image Anal. 2022 Apr;77:102357. doi: 10.1016/j.media.2022.102357. Epub 2022 Jan 18.
7
Multi-tailed vision transformer for efficient inference.多尾视觉转换器,用于高效推理。
Neural Netw. 2024 Jun;174:106235. doi: 10.1016/j.neunet.2024.106235. Epub 2024 Mar 14.
8
Dynamic Graph Message Passing Networks.动态图消息传递网络
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):5712-5730. doi: 10.1109/TPAMI.2022.3207500. Epub 2023 Apr 3.
9
PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention.PLG-ViT:具有并行局部和全局自注意力的视觉 Transformer。
Sensors (Basel). 2023 Mar 25;23(7):3447. doi: 10.3390/s23073447.
10
Contextual Transformer Networks for Visual Recognition.用于视觉识别的上下文Transformer网络
IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1489-1500. doi: 10.1109/TPAMI.2022.3164083. Epub 2023 Jan 6.

引用本文的文献

1
WMSA-WBS: Efficient Wave Multi-Head Self-Attention with Wavelet Bottleneck.WMSA-WBS:带小波瓶颈的高效波多头自注意力机制
Sensors (Basel). 2025 Aug 14;25(16):5046. doi: 10.3390/s25165046.
2
CMSAF-Net: integrative network design with enhanced decoder for precision segmentation of pear leaf diseases.CMSAF-Net:用于梨树病害精准分割的具有增强解码器的集成网络设计
Plant Methods. 2025 May 30;21(1):74. doi: 10.1186/s13007-025-01392-7.
3
CASF-Net: Underwater Image Enhancement with Color Correction and Spatial Fusion.CASF-Net:基于色彩校正和空间融合的水下图像增强技术
Sensors (Basel). 2025 Apr 18;25(8):2574. doi: 10.3390/s25082574.
4
Linguistic-visual based multimodal Yi character recognition.基于语言视觉的多模态彝文字符识别。
Sci Rep. 2025 Apr 7;15(1):11874. doi: 10.1038/s41598-025-96397-6.
5
[A multi-scale supervision and residual feedback optimization algorithm for improving optic chiasm and optic nerve segmentation accuracy in nasopharyngeal carcinoma CT images].一种用于提高鼻咽癌CT图像中视交叉和视神经分割精度的多尺度监督与残差反馈优化算法
Nan Fang Yi Ke Da Xue Xue Bao. 2025 Mar 20;45(3):632-642. doi: 10.12122/j.issn.1673-4254.2025.03.21.
6
A robust deep learning approach for segmenting cortical and trabecular bone from 3D high resolution µCT scans of mouse bone.一种用于从小鼠骨骼的3D高分辨率µCT扫描中分割皮质骨和小梁骨的强大深度学习方法。
Sci Rep. 2025 Mar 13;15(1):8656. doi: 10.1038/s41598-025-92954-1.
7
PPLA-Transformer: An Efficient Transformer for Defect Detection with Linear Attention Based on Pyramid Pooling.PPLA-Transformer:一种基于金字塔池化的线性注意力用于缺陷检测的高效Transformer
Sensors (Basel). 2025 Jan 30;25(3):828. doi: 10.3390/s25030828.
8
Multi-modal remote perception learning for object sensory data.用于对象感官数据的多模态远程感知学习。
Front Neurorobot. 2024 Sep 19;18:1427786. doi: 10.3389/fnbot.2024.1427786. eCollection 2024.
9
CTHNet: a network for wheat ear counting with local-global features fusion based on hybrid architecture.CTHNet:一种基于混合架构融合局部-全局特征的麦穗计数网络。
Front Plant Sci. 2024 Jul 2;15:1425131. doi: 10.3389/fpls.2024.1425131. eCollection 2024.
10
Multi-scale window transformer for cervical cytopathology image recognition.用于宫颈细胞病理学图像识别的多尺度窗口变换器
Comput Struct Biotechnol J. 2024 Apr 16;24:314-321. doi: 10.1016/j.csbj.2024.04.028. eCollection 2024 Dec.