Suppr超能文献

基于可微分搜索和递归知识蒸馏的 BERT 动态结构剪枝。

DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT.

机构信息

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Anhui Mine IOT and Security Monitoring Technology Key Laboratory, Hefei 230088, China.

出版信息

Neural Netw. 2024 May;173:106164. doi: 10.1016/j.neunet.2024.106164. Epub 2024 Feb 9.

Abstract

Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.

摘要

大规模预训练模型,如 BERT,在自然语言处理 (NLP) 领域展现出了卓越的性能。然而,这些模型中的高参数数量增加了对硬件存储和计算资源的需求,同时也对其实际部署提出了挑战。在本文中,我们提出了一种模型剪枝和知识蒸馏相结合的方法,用于压缩和加速大规模预训练语言模型。具体来说,我们引入了一种基于可微分搜索和递归知识蒸馏的动态结构剪枝方法,自动剪枝 BERT 模型,命名为 DDK。我们将网络剪枝的搜索空间定义为网络中每个层的所有前馈层通道和自注意力头,并利用可微分方法确定它们的最佳数量。此外,我们设计了一种递归知识蒸馏方法,该方法采用自适应加权从教师模型的多个中间层提取最重要的特征,并融合这些特征来监督学生网络的学习。我们在 GLUE 基准数据集上的实验结果和消融分析表明,我们提出的方法在平均性能方面优于其他先进方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验