Zhao Bin, Wu Yao, Wu Chengdong, Sun Ruohuai
School of Information Science and Engineering, Northeastern University, Shenyang, 110819, China.
Liaoning University of Technology College of Interdisciplinary Sciences, Jinzhou, 121001, China.
Sci Rep. 2025 Mar 10;15(1):8286. doi: 10.1038/s41598-025-93175-2.
The paper proposes a new M2ACD(Multi-Actor-Critic Deep Deterministic Policy Gradient) algorithm to apply trajectory planning of the robotic manipulator in complex environments. First, the paper presents a general inverse kinematics algorithm that transforms the inverse kinematics problem into a general Newton-MP iterative method. The M2ACD algorithm based on multiple actors and critics is structured. The dual-actor network reduces the overestimation of action values, minimizes the correlation between the actor and value networks, and mitigates instability during the actor's selection process caused by excessively high Q-values. The dual-critic network reduces the estimation bias of Q-values, ensuring more reliable action selection and enhancing the stability of Q-value estimation. Secondly, The robotic manipulator's TSR (two-stage reward) strategy is designed and divided into the approach and close. Rewards in the approach phase focuses on safely and efficiently approaching the target, and rewards in the close phase involves final adjustments before contact is made with the target. Thirdly, to solve the position hopping jitter problem in traditional reinforcement learning trajectory planning, the NURBS(Non-Uniform Rational B-Splines) curve is used to smooth the hopping trajectory generated by M2ACD. Finally, the correctness of the M2ACD and the kinematics algorithm is verified by experiments. The M2ACD algorithm demonstrated superior curve smoothing, convergence stability and convergence speed compared to the TD3, DARC and DDPG algorithms. The M2ACD algorithm can be effectively applied to collaborative robots' trajectory planning, establishing a foundation for subsequent research.
本文提出了一种新的M2ACD(多智能体-评论家深度确定性策略梯度)算法,用于在复杂环境中应用机器人操纵器的轨迹规划。首先,本文提出了一种通用的逆运动学算法,将逆运动学问题转化为一般的牛顿-多点迭代方法。构建了基于多个智能体和评论家的M2ACD算法。双智能体网络减少了动作值的高估,最小化了智能体网络和价值网络之间的相关性,并减轻了由于过高的Q值导致的智能体选择过程中的不稳定性。双评论家网络减少了Q值的估计偏差,确保了更可靠的动作选择,并增强了Q值估计的稳定性。其次,设计了机器人操纵器的TSR(两阶段奖励)策略,并分为接近和闭合阶段。接近阶段的奖励侧重于安全高效地接近目标,闭合阶段的奖励涉及在与目标接触之前的最终调整。第三,为了解决传统强化学习轨迹规划中的位置跳跃抖动问题,使用NURBS(非均匀有理B样条)曲线对M2ACD生成的跳跃轨迹进行平滑处理。最后,通过实验验证了M2ACD和运动学算法的正确性。与TD3、DARC和DDPG算法相比,M2ACD算法在曲线平滑、收敛稳定性和收敛速度方面表现优异。M2ACD算法可以有效地应用于协作机器人的轨迹规划,为后续研究奠定了基础。