Suppr超能文献

实现大语言模型在忆阻器交叉开关上的节能部署:大小协同。

Enabling Energy-Efficient Deployment of Large Language Models on Memristor Crossbar: A Synergy of Large and Small.

作者信息

Wang Zhehui, Luo Tao, Liu Cheng, Liu Weichen, Goh Rick Siow Mong, Wong Weng-Fai

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Feb;47(2):916-933. doi: 10.1109/TPAMI.2024.3483654. Epub 2025 Jan 9.

Abstract

Large language models (LLMs) have garnered substantial attention due to their promising applications in diverse domains. Nevertheless, the increasing size of LLMs comes with a significant surge in the computational requirements for training and deployment. Memristor crossbars have emerged as a promising solution, which demonstrated a small footprint and remarkably high energy efficiency in computer vision (CV) models. Memristors possess higher density compared to conventional memory technologies, making them highly suitable for effectively managing the extreme model size associated with LLMs. However, deploying LLMs on memristor crossbars faces three major challenges. First, the size of LLMs increases rapidly, already surpassing the capabilities of state-of-the-art memristor chips. Second, LLMs often incorporate multi-head attention blocks, which involve non-weight stationary multiplications that traditional memristor crossbars cannot support. Third, while memristor crossbars excel at performing linear operations, they are not capable of executing complex nonlinear operations in LLM such as softmax and layer normalization. To address these challenges, we present a novel architecture for the memristor crossbar that enables the deployment of state-of-the-art LLM on a single chip or package, eliminating the energy and time inefficiencies associated with off-chip communication. Our testing on BERT showed negligible accuracy loss. Compared to traditional memristor crossbars, our architecture achieves enhancements of up to in area overhead and in energy consumption. Compared to modern TPU/GPU systems, our architecture demonstrates at least a reduction in the area-delay product and a significant 69% energy consumption reduction.

摘要

大语言模型(LLMs)因其在不同领域的应用前景而备受关注。然而,LLMs规模的不断扩大,使得训练和部署的计算需求大幅增加。忆阻器交叉开关已成为一种有前景的解决方案,在计算机视觉(CV)模型中展现出小尺寸和极高的能源效率。与传统存储技术相比,忆阻器具有更高的密度,使其非常适合有效管理与LLMs相关的极大模型规模。然而,在忆阻器交叉开关上部署LLMs面临三大挑战。首先,LLMs的规模迅速增长,已超过了最先进忆阻器芯片的能力。其次,LLMs通常包含多头注意力模块,其中涉及传统忆阻器交叉开关无法支持的非权重固定乘法。第三,虽然忆阻器交叉开关擅长执行线性运算,但它们无法在LLMs中执行诸如softmax和层归一化等复杂的非线性运算。为应对这些挑战,我们提出了一种用于忆阻器交叉开关的新颖架构,该架构能够在单个芯片或封装上部署最先进的LLM,消除了与片外通信相关的能量和时间效率低下问题。我们在BERT上的测试显示精度损失可忽略不计。与传统忆阻器交叉开关相比,我们的架构在面积开销和能耗方面最多可实现提升。与现代TPU/GPU系统相比,我们的架构至少在面积延迟积方面有所降低,并且能耗显著降低了69%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验