Huang Wei, Zheng Xingyu, Ma Xudong, Qin Haotong, Lv Chengtao, Chen Hong, Luo Jie, Qi Xiaojuan, Liu Xianglong, Magno Michele
Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, 999077 China.
School of Computer Science and Engineering, Beihang University, Xueyuan Road, Beijing, 100191 China.
Vis Intell. 2024;2(1):36. doi: 10.1007/s44267-024-00070-x. Epub 2024 Dec 30.
The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially provide new insights and challenges for the low-bit quantization of LLaMA3 and other future LLMs, especially in addressing performance degradation issues that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers from non-negligible degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap at low bit-width that needs to be addressed in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.
LLaMA家族是一系列基础语言模型,参数范围从70亿到650亿,已成为最强大的开源大语言模型之一,也是多模态大语言模型(MLLM)中流行的大语言模型主干,广泛应用于计算机视觉和自然语言理解任务。特别是,LLaMA3模型最近已经发布,并通过对超过15T数据令牌进行超大规模预训练,在各个领域取得了令人瞩目的性能。鉴于大语言模型的低比特量化在资源受限场景中的广泛应用,我们探索了量化到低比特宽度时LLaMA3的能力。这一探索可能为LLaMA3和其他未来大语言模型的低比特量化提供新的见解和挑战,特别是在解决大语言模型压缩中出现的性能下降问题方面。具体而言,我们在1-8比特和各种数据集上全面评估了LLaMA3现有的10种训练后量化和LoRA微调(LoRA-FT)方法,以揭示LLaMA3的低比特量化性能。为了揭示低比特量化的MLLM的能力,我们使用训练后量化方法评估了基于LLaMA3的LLaVA-Next-8B模型在2-4超低比特下的性能。我们的实验结果表明,LLaMA3在语言和视觉环境中仍然存在不可忽视的性能下降,特别是在超低比特宽度下。这凸显了在未来发展中需要解决的低比特宽度下的显著性能差距。我们期望这项实证研究将对推进未来模型具有价值,推动大语言模型和多模态大语言模型在更低比特下实现更高的准确性,以提高实用性。