Suppr超能文献

ACERAC:精细时间离散化中的高效强化学习

ACERAC: Efficient Reinforcement Learning in Fine Time Discretization.

作者信息

Lyskawa Jakub, Wawrzynski Pawel

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2719-2731. doi: 10.1109/TNNLS.2022.3190973. Epub 2024 Feb 5.

Abstract

One of the main goals of reinforcement learning (RL) is to provide a way for physical machines to learn optimal behavior instead of being programmed. However, effective control of the machines usually requires fine time discretization. The most common RL methods apply independent random elements to each action, which is not suitable in that setting. It is not feasible because it causes the controlled system to jerk and does not ensure sufficient exploration since a single action is not long enough to create a significant experience that could be translated into policy improvement. In our view, these are the main obstacles that prevent the application of RL in contemporary control systems. To address these pitfalls, in this article, we introduce an RL framework and adequate analytical tools for actions that may be stochastically dependent in subsequent time instances. We also introduce an RL algorithm that approximately optimizes a policy that produces such actions. It applies experience replay (ER) to adjust the likelihood of sequences of previous actions to optimize expected n -step returns that the policy yields. The efficiency of this algorithm is verified against four other RL methods [continuous deep advantage updating (CDAU), proximal policy optimization (PPO), soft actor-critic (SAC), and actor-critic with ER (ACER)] in four simulated learning control problems (Ant, HalfCheetah, Hopper, and Walker2D) in diverse time discretization. The algorithm introduced here outperforms the competitors in most cases considered.

摘要

强化学习(RL)的主要目标之一是为物理机器提供一种学习最优行为的方式,而非通过编程实现。然而,对机器的有效控制通常需要精细的时间离散化。最常见的RL方法对每个动作应用独立的随机元素,这在那种情况下并不适用。这是不可行的,因为它会导致受控系统产生抖动,并且由于单个动作持续时间不足够长,无法创造出可转化为策略改进的显著经验,从而不能确保充分的探索。在我们看来,这些是阻碍RL在当代控制系统中应用的主要障碍。为了解决这些问题,在本文中,我们针对在后续时间实例中可能具有随机依赖性的动作,引入了一个RL框架和适当的分析工具。我们还引入了一种RL算法,该算法近似优化产生此类动作的策略。它应用经验回放(ER)来调整先前动作序列的可能性,以优化该策略产生的预期n步回报。在四个模拟学习控制问题(蚂蚁、半猎豹、跳虫和人形机器人2D)中,针对不同的时间离散化,将该算法的效率与其他四种RL方法[连续深度优势更新(CDAU)、近端策略优化(PPO)、软演员评论家(SAC)和带ER的演员评论家(ACER)]进行了验证。在大多数考虑的情况下,这里介绍的算法优于竞争对手。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验