Zhou Yue, Wang Yasai, Zhuge Fuwei, Guo Jianmiao, Ma Sijie, Wang Jingli, Tang Zijian, Li Yi, Miao Xiangshui, He Yuhui, Chai Yang
Wuhan National Laboratory for Optoelectronics, School of Integrated Circuits, Huazhong University of Science and Technology, Wuhan, 430000, China.
Department of Applied Physics, The Hong Kong Polytechnic University, Hong Kong, 999077, China.
Adv Mater. 2022 Dec;34(48):e2107754. doi: 10.1002/adma.202107754. Epub 2022 Feb 25.
Reward-modulated spike-timing-dependent plasticity (R-STDP) is a brain-inspired reinforcement learning (RL) rule, exhibiting potential for decision-making tasks and artificial general intelligence. However, the hardware implementation of the reward-modulation process in R-STDP usually requires complicated Si complementary metal-oxide-semiconductor (CMOS) circuit design that causes high power consumption and large footprint. Here, a design with two synaptic transistors (2T) connected in a parallel structure is experimentally demonstrated. The 2T unit based on WSe ferroelectric transistors exhibits reconfigurable polarity behavior, where one channel can be tuned as n-type and the other as p-type due to nonvolatile ferroelectric polarization. In this way, opposite synaptic weight update behaviors with multilevel (>6 bit) conductance states, ultralow nonlinearity (0.56/-1.23), and large G /G ratio of 30 are realized. By applying positive/negative reward to (anti-)STDP component of 2T cell, R-STDP learning rules are realized for training the spiking neural network and demonstrated to solve the classical cart-pole problem, exhibiting a way for realizing low-power (32 pJ per forward process) and highly area-efficient (100 µm ) hardware chip for reinforcement learning.
奖励调制的尖峰时间依赖可塑性(R-STDP)是一种受大脑启发的强化学习(RL)规则,在决策任务和通用人工智能方面展现出潜力。然而,R-STDP中奖励调制过程的硬件实现通常需要复杂的硅互补金属氧化物半导体(CMOS)电路设计,这会导致高功耗和大尺寸。在此,实验展示了一种由两个以并联结构连接的突触晶体管(2T)组成的设计。基于WSe铁电晶体管的2T单元表现出可重构的极性行为,由于非易失性铁电极化,其中一个通道可被调制成n型,另一个通道可被调制成p型。通过这种方式,实现了具有多级(>6位)电导状态、超低非线性(0.56/-1.23)以及30的大G /G比的相反突触权重更新行为。通过对2T单元的(反)STDP组件施加正/负奖励,实现了用于训练脉冲神经网络的R-STDP学习规则,并证明其能解决经典的推车摆杆问题,展示了一种实现用于强化学习的低功耗(每次前向过程32 pJ)和高面积效率(100 µm )硬件芯片的方法。