Belhaouari Samir Brahim, Kraidia Insaf
Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Ar-Rayyan, Qatar.
Faculty of Information Technology, Department of Networks and Cybersecurity, Al-Ahliyya Amman University, Amman, Jordan.
Sci Rep. 2025 Mar 24;15(1):10171. doi: 10.1038/s41598-025-92586-5.
Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs' computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.
大语言模型(LLMs)通过实现跨多个领域的多任务处理,彻底改变了人工智能。然而,它们对计算的高要求导致了重大的环境影响,特别是在能源和水资源消耗方面。本文通过提出一种创新的压缩方法来减小大语言模型的大小,从而解决这些问题。我们专注于压缩内部变压器层,这些层是大语言模型计算复杂性的关键因素。我们的方法结合了用于模型压缩的新数学和结构关键方法。我们首先应用前向传播剪枝(FPP)来压缩嵌入层和前馈层,利用权重冻结和归零技术处理疑似未使用的参数。这减少了可训练参数的数量,加速了整体训练过程并实现了更快的收敛。其次,引入权重矩阵折叠方法,以在一个简单高效的数学模型中有效地剪枝自注意力层矩阵。该方法集成了相同行压缩(IRC)来优化查询和键矩阵的压缩,同时采用对角权重压缩(DWC),将值矩阵重新构造成对角结构。因此,该技术显著降低了三个指标上的参数变异性,提高了一致性和性能,同时简化了复杂性。我们在三个语言建模数据集和八个广泛使用的分类数据集上评估了这种压缩方法,并将其与各种剪枝方法进行比较。我们的方法成功地将变压器层压缩了99%,线性层压缩了70%,从而使整体模型压缩率达到约70%,同时保持了几乎相同的准确率。值得注意的是,在20%至40%的适度压缩率下,模型性能不仅保持稳定,甚至有所提高。这导致内存使用和计算需求大幅减少,使大语言模型更具资源效率,并突出了为更可持续的人工智能未来对其进行优化的潜力。