Li Yanjie, Yin Baoqun, Xi Hongsheng
Department of Automation, University of Science and Technology of China, Hefei 230026, China.
IEEE Trans Syst Man Cybern B Cybern. 2008 Dec;38(6):1645-51. doi: 10.1109/TSMCB.2008.927711.
The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.
基于灵敏度的马尔可夫系统优化已成为一个日益重要的领域。从性能灵敏度分析的角度来看,可直接获得马尔可夫决策过程(MDP)的策略迭代算法和梯度估计方法。在本通信中,基于灵敏度的优化被扩展到平均奖励部分可观测MDP(POMDP)。我们推导了POMDP的性能差异和性能导数公式。基于性能导数公式,我们提出了一种估计性能梯度的新方法。从性能差异公式中,我们得到了一个无需折扣奖励公式的充分最优性条件。我们还提出了一种策略迭代算法来获得一个近似最优的有限状态控制器策略。