Wang Hui, Han Zhiwei, Liu Wenqiang, Wu Yanbo
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):5915-5928. doi: 10.1109/TNNLS.2022.3219814. Epub 2024 May 2.
In high-speed railways, the pantograph-catenary system (PCS) is a critical subsystem of the train power supply system. In particular, when the double-PCS (DPCS) is in operation, the passing of the leading pantograph (LP) causes the contact force of the trailing pantograph (TP) to fluctuate violently, affecting the power collection quality of the electric multiple units (EMUs). The actively controlled pantograph is the most promising technique for reducing the pantograph-catenary contact force (PCCF) fluctuation and improving the current collection quality. Based on the Nash equilibrium framework, this study proposes a multiagent reinforcement learning (MARL) algorithm for active pantograph control called cooperative proximity policy optimization (Coo-PPO). In the algorithm implementation, the heterogeneous agents play a unique role in a cooperative environment guided by the global value function. Then, a novel reward propagation channel is proposed to reveal implicit associations between agents. Furthermore, a curriculum learning approach is adopted to strike a balance between reward maximization and rational movement patterns. An existing MARL algorithm and a traditional control strategy are compared in the same scenario to validate the proposed control strategy's performance. The experimental results show that the Coo-PPO algorithm obtains more rewards, significantly suppresses the fluctuation in PCCF (up to 41.55%), and dramatically decreases the TP's offline rate (up to 10.77%). This study adopts MARL technology for the first time to address the coordinated control of double pantographs in DPCS.
在高速铁路中,受电弓-接触网系统(PCS)是列车供电系统的关键子系统。特别是,当双受电弓-接触网系统(DPCS)运行时,前导受电弓(LP)的通过会导致后随受电弓(TP)的接触力剧烈波动,影响动车组(EMU)的集电质量。主动控制受电弓是降低受电弓-接触网接触力(PCCF)波动并提高集流质量最具前景的技术。基于纳什均衡框架,本研究提出了一种用于主动受电弓控制的多智能体强化学习(MARL)算法,称为协作近端策略优化(Coo-PPO)。在算法实现中,异构智能体在由全局价值函数引导的协作环境中发挥独特作用。然后,提出了一种新颖的奖励传播通道,以揭示智能体之间的隐含关联。此外,采用课程学习方法在奖励最大化和合理运动模式之间取得平衡。在相同场景下将现有的MARL算法和传统控制策略进行比较,以验证所提出控制策略的性能。实验结果表明,Coo-PPO算法获得了更多奖励,显著抑制了PCCF的波动(高达41.55%),并大幅降低了TP的离线率(高达10.77%)。本研究首次采用MARL技术来解决DPCS中双受电弓的协调控制问题。