Ma Xiaolong, Lin Sheng, Ye Shaokai, He Zhezhi, Zhang Linfeng, Yuan Geng, Tan Sia Huat, Li Zhengang, Fan Deliang, Qian Xuehai, Lin Xue, Ma Kaisheng, Wang Yanzhi
IEEE Trans Neural Netw Learn Syst. 2022 Sep;33(9):4930-4944. doi: 10.1109/TNNLS.2021.3063265. Epub 2022 Aug 31.
Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. It motivates the intensive research on model compression with two main approaches. Weight pruning leverages the redundancy in the number of weights and can be performed in a non-structured, which has higher flexibility and pruning rate but incurs index accesses due to irregular weights, or structured manner, which preserves the full matrix structure with a lower pruning rate. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning, quantization is much more hardware-friendly and has become a "must-do" step for FPGA and ASIC implementations. Thus, any evaluation of the effectiveness of pruning should be on top of quantization. The key open question is, with quantization, what kind of pruning (non-structured versus structured) is most beneficial? This question is fundamental because the answer will determine the design aspects that we should really focus on to avoid the diminishing return of certain optimizations. This article provides a definitive answer to the question for the first time. First, we build ADMM-NN-S by extending and enhancing ADMM-NN, a recently proposed joint weight pruning and quantization framework, with the algorithmic supports for structured pruning, dynamic ADMM regulation, and masked mapping and retraining. Second, we develop a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency. Our results show that ADMM-NN-S consistently outperforms the prior art: 1) it achieves 348× , 36× , and 8× overall weight pruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zero accuracy loss and 2) we demonstrate the first fully binarized (for all layers) DNNs can be lossless in accuracy in many cases. These results provide a strong baseline and credibility of our study. Based on the proposed comparison framework, with the same accuracy and quantization, the results show that non-structured pruning is not competitive in terms of both storage and computation efficiency. Thus, we conclude that structured pruning has a greater potential compared to non-structured pruning. We encourage the community to focus on studying the DNN inference acceleration with structured sparsity.
大型深度神经网络(DNN)模型对能源效率构成了关键挑战,因为片外DRAM访问的能耗显著高于算术运算或SRAM操作。这激发了对模型压缩的深入研究,主要有两种方法。权重剪枝利用了权重数量上的冗余,可以以非结构化方式进行,这种方式具有更高的灵活性和剪枝率,但由于权重不规则会导致索引访问;也可以以结构化方式进行,这种方式以较低的剪枝率保留完整的矩阵结构。权重量化利用了权重中比特数量上的冗余。与剪枝相比,量化对硬件更友好,已成为FPGA和ASIC实现的“必做”步骤。因此,对剪枝有效性的任何评估都应在量化的基础上进行。关键的开放性问题是,在量化的情况下,哪种剪枝(非结构化与结构化)最有益?这个问题至关重要,因为答案将决定我们真正应该关注的设计方面,以避免某些优化的收益递减。本文首次为该问题提供了明确答案。首先,我们通过扩展和增强ADMM-NN构建了ADMM-NN-S,ADMM-NN是最近提出的联合权重剪枝和量化框架,具备结构化剪枝、动态ADMM调节以及掩码映射和重新训练的算法支持。其次,我们开发了一种方法,用于在存储和计算效率方面对非结构化和结构化剪枝进行公平且基础的比较。我们的结果表明,ADMM-NN-S始终优于现有技术:1)它在LeNet-5、AlexNet和ResNet-50上分别实现了348倍、36倍和8倍的总体权重剪枝,且(几乎)零精度损失;2)我们证明了首个全二值化(所有层)的DNN在许多情况下可以无损精度。这些结果为我们的研究提供了强大的基线和可信度。基于所提出的比较框架,在相同精度和量化条件下,结果表明非结构化剪枝在存储和计算效率方面没有竞争力。因此,我们得出结论,与非结构化剪枝相比,结构化剪枝具有更大的潜力。我们鼓励社区专注于研究具有结构化稀疏性的DNN推理加速。