Shang Jinghuan, Li Xiang, Kahatapitiya Kumara, Lee Yu-Cheol, Ryoo Michael S
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12862-12877. doi: 10.1109/TPAMI.2022.3204708. Epub 2023 Oct 3.
Reinforcement Learning (RL) can be considered as a sequence modeling task, where an agent employs a sequence of past state-action-reward experiences to predict a sequence of future actions. In this work, we propose State-Action-Reward Transformer (StARformer), a Transformer architecture for robot learning with image inputs, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. StARformer first extracts StAR-representations using self-attending patches of image states, action, and reward tokens within a short temporal window. These StAR-representations are combined with pure image state representations, extracted as convolutional features, to perform self-attention over the whole sequence. Our experimental results show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, under both offline-RL and imitation learning settings. We find that models can benefit from our combination of patch-wise and convolutional image embeddings. StARformer is also more compliant with longer sequences of inputs than the baseline method. Finally, we demonstrate how StARformer can be successfully applied to a real-world robot imitation learning setting via a human-following task.
强化学习(RL)可以被视为一种序列建模任务,其中智能体利用过去的状态-动作-奖励经验序列来预测未来的动作序列。在这项工作中,我们提出了状态-动作-奖励变换器(StARformer),这是一种用于机器人图像输入学习的变换器架构,它明确地对短期状态-动作-奖励表示(StAR表示)进行建模,本质上引入了一种类似马尔可夫的归纳偏差来改进长期建模。StARformer首先在短时间窗口内使用图像状态、动作和奖励令牌的自注意力补丁来提取StAR表示。这些StAR表示与作为卷积特征提取的纯图像状态表示相结合,以对整个序列执行自注意力。我们的实验结果表明,在离线强化学习和模仿学习设置下,StARformer在基于图像的雅达利游戏和深度思维控制套件基准测试中优于基于变换器的现有方法。我们发现模型可以从我们的逐补丁和卷积图像嵌入的组合中受益。与基线方法相比,StARformer也更能适应更长的输入序列。最后,我们展示了StARformer如何通过跟随人类任务成功应用于现实世界的机器人模仿学习设置。