Xu Yifei, Xie Jianwen, Zhao Tianyang, Baker Chris, Zhao Yibiao, Wu Ying Nian
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):10563-10577. doi: 10.1109/TNNLS.2022.3168795. Epub 2023 Nov 30.
The problem of continuous inverse optimal control (over finite time horizon) is to learn the unknown cost function over the sequence of continuous control variables from expert demonstrations. In this article, we study this fundamental problem in the framework of energy-based model (EBM), where the observed expert trajectories are assumed to be random samples from a probability density function defined as the exponential of the negative cost function up to a normalizing constant. The parameters of the cost function are learned by maximum likelihood via an "analysis by synthesis" scheme, which iterates: 1) synthesis step: sample the synthesized trajectories from the current probability density using the Langevin dynamics via backpropagation through time and 2) analysis step: update the model parameters based on the statistical difference between the synthesized trajectories and the observed trajectories. Given the fact that an efficient optimization algorithm is usually available for an optimal control problem, we also consider a convenient approximation of the above learning method, where we replace the sampling in the synthesis step by optimization. Moreover, to make the sampling or optimization more efficient, we propose to train the EBM simultaneously with a top-down trajectory generator via cooperative learning, where the trajectory generator is used to fast initialize the synthesis step of the EBM. We demonstrate the proposed methods on autonomous driving tasks and show that they can learn suitable cost functions for optimal control.
连续逆最优控制问题(在有限时间范围内)是要从专家演示中学习连续控制变量序列上的未知成本函数。在本文中,我们在基于能量的模型(EBM)框架下研究这个基本问题,其中假设观察到的专家轨迹是来自一个概率密度函数的随机样本,该概率密度函数被定义为负成本函数的指数再加上一个归一化常数。成本函数的参数通过一种“综合分析”方案以最大似然法学习,该方案迭代如下:1)综合步骤:通过时间反向传播使用朗之万动力学从当前概率密度中采样合成轨迹;2)分析步骤:根据合成轨迹和观察到的轨迹之间的统计差异更新模型参数。鉴于通常可以为最优控制问题获得高效的优化算法,我们还考虑了上述学习方法的一种便捷近似,即我们用优化来代替合成步骤中的采样。此外,为了使采样或优化更高效,我们建议通过协作学习将EBM与自上而下的轨迹生成器同时训练,其中轨迹生成器用于快速初始化EBM的合成步骤。我们在自动驾驶任务上演示了所提出的方法,并表明它们可以学习适合最优控制的成本函数。