• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于可微分搜索和递归知识蒸馏的 BERT 动态结构剪枝。

DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT.

机构信息

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Anhui Mine IOT and Security Monitoring Technology Key Laboratory, Hefei 230088, China.

出版信息

Neural Netw. 2024 May;173:106164. doi: 10.1016/j.neunet.2024.106164. Epub 2024 Feb 9.

DOI:10.1016/j.neunet.2024.106164
PMID:38367353
Abstract

Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.

摘要

大规模预训练模型,如 BERT,在自然语言处理 (NLP) 领域展现出了卓越的性能。然而,这些模型中的高参数数量增加了对硬件存储和计算资源的需求,同时也对其实际部署提出了挑战。在本文中,我们提出了一种模型剪枝和知识蒸馏相结合的方法,用于压缩和加速大规模预训练语言模型。具体来说,我们引入了一种基于可微分搜索和递归知识蒸馏的动态结构剪枝方法,自动剪枝 BERT 模型,命名为 DDK。我们将网络剪枝的搜索空间定义为网络中每个层的所有前馈层通道和自注意力头,并利用可微分方法确定它们的最佳数量。此外,我们设计了一种递归知识蒸馏方法,该方法采用自适应加权从教师模型的多个中间层提取最重要的特征,并融合这些特征来监督学生网络的学习。我们在 GLUE 基准数据集上的实验结果和消融分析表明,我们提出的方法在平均性能方面优于其他先进方法。

相似文献

1
DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT.基于可微分搜索和递归知识蒸馏的 BERT 动态结构剪枝。
Neural Netw. 2024 May;173:106164. doi: 10.1016/j.neunet.2024.106164. Epub 2024 Feb 9.
2
Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression.基于联合双特征蒸馏和梯度渐进剪枝的 BERT 压缩。
Neural Netw. 2024 Nov;179:106533. doi: 10.1016/j.neunet.2024.106533. Epub 2024 Jul 17.
3
LAD: Layer-Wise Adaptive Distillation for BERT Model Compression.层叠自适应蒸馏技术在 BERT 模型压缩中的应用
Sensors (Basel). 2023 Jan 28;23(3):1483. doi: 10.3390/s23031483.
4
AUBER: Automated BERT regularization.AUBER:自动 BERT 正则化。
PLoS One. 2021 Jun 28;16(6):e0253241. doi: 10.1371/journal.pone.0253241. eCollection 2021.
5
DMPP: Differentiable multi-pruner and predictor for neural network pruning.DMPP:用于神经网络剪枝的可微分多修剪器和预测器。
Neural Netw. 2022 Mar;147:103-112. doi: 10.1016/j.neunet.2021.12.020. Epub 2021 Dec 30.
6
Knowledge distillation based on multi-layer fusion features.基于多层融合特征的知识蒸馏。
PLoS One. 2023 Aug 28;18(8):e0285901. doi: 10.1371/journal.pone.0285901. eCollection 2023.
7
BERTtoCNN: Similarity-preserving enhanced knowledge distillation for stance detection.BERTtoCNN:用于立场检测的保相似性增强知识蒸馏。
PLoS One. 2021 Sep 10;16(9):e0257130. doi: 10.1371/journal.pone.0257130. eCollection 2021.
8
Improving Differentiable Architecture Search via self-distillation.通过自蒸馏改进可微架构搜索。
Neural Netw. 2023 Oct;167:656-667. doi: 10.1016/j.neunet.2023.08.062. Epub 2023 Sep 9.
9
Leveraging different learning styles for improved knowledge distillation in biomedical imaging.利用不同的学习方式提高生物医学成像中的知识蒸馏效果。
Comput Biol Med. 2024 Jan;168:107764. doi: 10.1016/j.compbiomed.2023.107764. Epub 2023 Nov 30.
10
Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms.知识融合蒸馏:利用多尺度注意力机制改进蒸馏
Neural Process Lett. 2023 Jan 3:1-16. doi: 10.1007/s11063-022-11132-w.

引用本文的文献

1
Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization.基于知识蒸馏和低秩分解的轻量级预训练韩语语言模型
Entropy (Basel). 2025 Apr 2;27(4):379. doi: 10.3390/e27040379.