Li Meng, Cao Zehong, Li Zhibin
IEEE Trans Neural Netw Learn Syst. 2021 Dec;32(12):5309-5322. doi: 10.1109/TNNLS.2021.3071959. Epub 2021 Nov 30.
The vehicle platoon will be the most dominant driving mode on future roads. To the best of our knowledge, few reinforcement learning (RL) algorithms have been applied in vehicle platoon control, which has large-scale action and state spaces. Some RL-based methods were applied to solve single-agent problems. If we need to tackle multiagent problems, we will use multiagent RL algorithms since the parameters space grows exponentially with the increasing number of agents involved. Previous multiagent RL algorithms generally may provide redundant information to agents, indicating a large amount of useless or unrelated information, which may cause to be difficult for convergence training and pattern extractions from shared information. Also, random actions usually contribute to crashes, especially at the beginning of training. In this study, a communication proximal policy optimization (CommPPO) algorithm was proposed to tackle the above issues. In specific, the CommPPO model adopts a parameter-sharing structure to allow the dynamic variation of agent numbers, which can well handle various platoon dynamics, including splitting and merging. The communication protocol of the CommPPO consists of two parts. In the state part, the widely used predecessor-leader follower typology in the platoon is adopted to transmit global and local state information to agents. In the reward part, a new reward communication channel is proposed to solve the spurious reward and "lazy agent" problems in some existing multiagent RLs. Moreover, a curriculum learning approach is adopted to reduce crashes and speed up training. To validate the proposed strategy for platoon control, two existing multiagent RLs and a traditional platoon control strategy were applied in the same scenarios for comparison. Results showed that the CommPPO algorithm gained more rewards and achieved the largest fuel consumption reduction (11.6%).
车辆编队将成为未来道路上最主要的驾驶模式。据我们所知,很少有强化学习(RL)算法应用于车辆编队控制,因为其具有大规模的动作和状态空间。一些基于RL的方法被用于解决单智能体问题。如果我们需要处理多智能体问题,我们将使用多智能体RL算法,因为参数空间会随着所涉及智能体数量的增加而呈指数增长。以前的多智能体RL算法通常可能会向智能体提供冗余信息,即大量无用或不相关的信息,这可能导致收敛训练困难以及从共享信息中提取模式变得困难。此外,随机动作通常会导致碰撞,尤其是在训练开始时。在本研究中,提出了一种通信近端策略优化(CommPPO)算法来解决上述问题。具体而言,CommPPO模型采用参数共享结构以允许智能体数量动态变化,这可以很好地处理各种编队动态,包括分裂和合并。CommPPO的通信协议由两部分组成。在状态部分,采用编队中广泛使用的前驱 - 领导者 - 跟随者类型学将全局和局部状态信息传输给智能体。在奖励部分,提出了一种新的奖励通信通道来解决一些现有多智能体RL中的虚假奖励和“懒惰智能体”问题。此外,采用课程学习方法来减少碰撞并加快训练。为了验证所提出的编队控制策略,将两种现有的多智能体RL和一种传统的编队控制策略应用于相同场景进行比较。结果表明,CommPPO算法获得了更多奖励,并实现了最大的燃油消耗降低(11.6%)。