School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Anhui Mine IOT and Security Monitoring Technology Key Laboratory, Hefei 230088, China.
Neural Netw. 2024 Nov;179:106533. doi: 10.1016/j.neunet.2024.106533. Epub 2024 Jul 17.
The increasing size of pre-trained language models has led to a growing interest in model compression. Pruning and distillation are the primary methods employed to compress these models. Existing pruning and distillation methods are effective in maintaining model accuracy and reducing its size. However, they come with limitations. For instance, pruning is often suboptimal and biased by transforming it into a continuous optimization problem. Distillation relies primarily on one-to-one layer mappings for knowledge transfer, which leads to underutilization of the rich knowledge in teacher. Therefore, we propose a method of joint pruning and distillation for automatic pruning of pre-trained language models. Specifically, we first propose Gradient Progressive Pruning (GPP), which achieves a smooth transition of indicator vector values from real to binary by progressively converging the values of unimportant units' indicator vectors to zero before the end of the search phase. This effectively overcomes the limitations of traditional pruning methods while supporting compression with higher sparsity. In addition, we propose the Dual Feature Distillation (DFD). DFD adaptively globally fuses teacher features and locally fuses student features, and then uses the dual features of global teacher features and local student features for knowledge distillation. This realizes a "preview-review" mechanism that can better extract useful information from multi-level teacher information and transfer it to student. Comparative experiments on the GLUE benchmark dataset and ablation experiments indicate that our method outperforms other state-of-the-art methods.
预训练语言模型的规模不断增大,促使人们对模型压缩技术产生了浓厚的兴趣。剪枝和蒸馏是压缩这些模型的主要方法。现有的剪枝和蒸馏方法在保持模型精度和减小模型规模方面非常有效。然而,它们也存在一些局限性。例如,剪枝往往不是最优的,并且将其转化为连续优化问题会产生偏差。蒸馏主要依赖于一对一的层映射进行知识转移,这导致教师的丰富知识未得到充分利用。因此,我们提出了一种联合剪枝和蒸馏的方法,用于自动剪枝预训练语言模型。具体来说,我们首先提出了梯度渐进剪枝(Gradient Progressive Pruning,GPP),它通过在搜索阶段结束前逐渐将不重要单元的指示向量值收敛到零,实现了指示向量值从实数到二进制的平滑过渡。这有效地克服了传统剪枝方法的局限性,同时支持更高稀疏度的压缩。此外,我们还提出了双特征蒸馏(Dual Feature Distillation,DFD)。DFD 自适应地全局融合教师特征,局部融合学生特征,然后使用全局教师特征的双特征和局部学生特征进行知识蒸馏。这实现了一种“预览-回顾”机制,可以更好地从多层次教师信息中提取有用信息,并将其转移到学生中。在 GLUE 基准数据集上的对比实验和消融实验表明,我们的方法优于其他最先进的方法。