IBM Research-Almaden, San Jose, CA, USA.
IBM Research-Zurich, Rueschlikon, Switzerland.
Nature. 2018 Jun;558(7708):60-67. doi: 10.1038/s41586-018-0180-5. Epub 2018 Jun 6.
Neural-network training can be slow and energy intensive, owing to the need to transfer the weight data for the network between conventional digital memory chips and processor chips. Analogue non-volatile memory can accelerate the neural-network training algorithm known as backpropagation by performing parallelized multiply-accumulate operations in the analogue domain at the location of the weight data. However, the classification accuracies of such in situ training using non-volatile-memory hardware have generally been less than those of software-based training, owing to insufficient dynamic range and excessive weight-update asymmetry. Here we demonstrate mixed hardware-software neural-network implementations that involve up to 204,900 synapses and that combine long-term storage in phase-change memory, near-linear updates of volatile capacitors and weight-data transfer with 'polarity inversion' to cancel out inherent device-to-device variations. We achieve generalization accuracies (on previously unseen data) equivalent to those of software-based training on various commonly used machine-learning test datasets (MNIST, MNIST-backrand, CIFAR-10 and CIFAR-100). The computational energy efficiency of 28,065 billion operations per second per watt and throughput per area of 3.6 trillion operations per second per square millimetre that we calculate for our implementation exceed those of today's graphical processing units by two orders of magnitude. This work provides a path towards hardware accelerators that are both fast and energy efficient, particularly on fully connected neural-network layers.
神经网络的训练可能会非常缓慢且耗能巨大,这是因为需要在传统数字存储芯片和处理器芯片之间传输网络的权重数据。模拟非易失性存储器可以通过在权重数据所在位置的模拟域中执行并行乘法累加操作,从而加速被称为反向传播的神经网络训练算法。然而,由于动态范围不足和权重更新的严重不对称,使用非易失性存储器硬件进行的这种原位训练的分类精度通常低于基于软件的训练。在这里,我们展示了混合硬件-软件神经网络实现,涉及多达 204900 个突触,并结合了相变存储器中的长期存储、易失性电容器的近线性更新以及带有“极性反转”的权重数据传输,以消除固有器件间的变化。我们在各种常用的机器学习测试数据集(MNIST、MNIST-backrand、CIFAR-10 和 CIFAR-100)上实现了与基于软件的训练相当的泛化精度(在未见数据上)。我们计算出的实现的计算能效为每秒每瓦 2806.5 万亿次运算和每平方毫米每秒 3.6 万亿次运算的吞吐量,比当今的图形处理单元高出两个数量级。这项工作为快速且节能的硬件加速器提供了一条途径,特别是在全连接神经网络层上。