Liu Ruiyang, Li Yinghui, Tao Linmi, Liang Dun, Zheng Hai-Tao
Department of Computer Science and Technology, BNRist, Tsinghua University & Key Lab of Pervasive Computing, Ministry of Education of China, Beijing 100084, China.
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China.
Patterns (N Y). 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520.
Recently, the proposed deep multilayer perceptron (MLP) models have stirred up a lot of interest in the vision community. Historically, the availability of larger datasets combined with increased computing capacity led to paradigm shifts. This review provides detailed discussions on whether MLPs can be a new paradigm for computer vision. We compare the intrinsic connections and differences between convolution, self-attention mechanism, and token-mixing MLP in detail. Advantages and limitations of token-mixing MLP are provided, followed by careful analysis of recent MLP-like variants, from module design to network architecture, and their applications. In the graphics processing unit era, the locally and globally weighted summations are the current mainstreams, represented by the convolution and self-attention mechanism, as well as MLPs. We suggest the further development of the paradigm to be considered alongside the next-generation computing devices.
最近,提出的深度多层感知器(MLP)模型在视觉领域引起了广泛关注。从历史上看,更大数据集的可用性与计算能力的提升导致了范式的转变。本综述详细讨论了MLP是否能成为计算机视觉的新范式。我们详细比较了卷积、自注意力机制和令牌混合MLP之间的内在联系和差异。提供了令牌混合MLP的优点和局限性,随后仔细分析了从模块设计到网络架构的近期类MLP变体及其应用。在图形处理单元时代,局部和全局加权求和是当前主流,以卷积、自注意力机制以及MLP为代表。我们建议在考虑下一代计算设备的同时,对该范式进行进一步发展。