Guo Yufei, Hao Zecheng, Shao Jiahang, Zhou Jie, Liu Xiaode, Tong Xin, Zhang Yuhan, Chen Yuanpei, Peng Weihang, Ma Zhe
Intelligent Science & Technology Academy of CASIC, China.
School of Computer Science, Peking University, China.
Neural Netw. 2025 Nov;191:107855. doi: 10.1016/j.neunet.2025.107855. Epub 2025 Jul 9.
The deployment of Large Language Models (LLMs) has been constrained by their substantial hardware requirements and associated costs. Quantization techniques have emerged as a promising solution to address these challenges. Recently, BitNet [Wang et al., 2023] proposed to use ternary values (+1, 0, -1) for weight quantization showing particular promise in eliminating multiplication operations, further significantly reducing the latency and energy consumption. However, BitNet's requirement for training models from scratch limits its scalability to models larger than 3 billion parameters. This paper introduces PT-BitNet, a novel post-training quantization method that extends the benefits of BitNet's ternary quantization to large-scale language models up to 70B parameters. To effectively quantize the model parameters down to ±1,0, we propose a two-stage algorithm. In the first stage, we transform the weight distribution to a quantization-friendly one, and in the second stage, we optimize the weight elements in a block-wise manner. We demonstrate the effectiveness of PT-BitNet through comprehensive experiments on various model sizes and downstream tasks. Our results show that PT-BitNet achieves substantial reductions in model size and inference time, with minimal impact on task performance. For example, PT-BitNet scales to 70B parameters LLM with 61 % average downstream accuracy, significantly outperforming the BitNet b.158 with 51.2 % average accuracy.
大语言模型(LLMs)的部署受到其巨大硬件需求和相关成本的限制。量化技术已成为应对这些挑战的一种有前景的解决方案。最近,BitNet[Wang等人,2023]提出使用权重量化的三值(+1、0、-1),在消除乘法运算方面显示出特别的前景,进一步显著降低延迟和能耗。然而,BitNet从零开始训练模型的要求限制了其扩展到参数超过30亿的模型的能力。本文介绍了PT-BitNet,一种新颖的训练后量化方法,它将BitNet三值量化的优势扩展到高达700亿参数的大规模语言模型。为了有效地将模型参数量化到±1、0,我们提出了一种两阶段算法。在第一阶段,我们将权重分布转换为便于量化的分布,在第二阶段,我们以分块方式优化权重元素。我们通过对各种模型大小和下游任务的全面实验证明了PT-BitNet的有效性。我们的结果表明,PT-BitNet在模型大小和推理时间上实现了大幅减少,对任务性能的影响最小。例如,PT-BitNet可以扩展到700亿参数的大语言模型,平均下游准确率为61%,显著优于平均准确率为51.2%的BitNet b.158。