Liao Peng, Qi Zhengling, Wan Runzhe, Klasnja Predrag, Murphy Susan A
Harvard University.
George Washington University.
Ann Stat. 2022 Dec;50(6):3364-3387. doi: 10.1214/22-aos2231. Epub 2022 Dec 21.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.
我们考虑无限期马尔可夫决策过程中的批量(离线)策略学习问题。受移动健康应用的启发,我们专注于学习一种能使长期平均奖励最大化的策略。我们提出了一种用于平均奖励的双稳健估计器,并证明它能实现半参数效率。此外,我们开发了一种优化算法,以在参数化随机策略类中计算最优策略。估计策略的性能通过策略类中的最优平均奖励与估计策略的平均奖励之间的差异来衡量,并且我们建立了有限样本遗憾保证。通过模拟研究和对一项促进身体活动的移动健康研究的分析来说明该方法的性能。