平均奖励马尔可夫决策过程中的批量策略学习

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

作者信息

Liao Peng, Qi Zhengling, Wan Runzhe, Klasnja Predrag, Murphy Susan A

机构信息

Harvard University.

George Washington University.

出版信息

Ann Stat. 2022 Dec;50(6):3364-3387. doi: 10.1214/22-aos2231. Epub 2022 Dec 21.

DOI:10.1214/22-aos2231

PMID:37022318

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10072865/

Abstract

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

摘要

我们考虑无限期马尔可夫决策过程中的批量（离线）策略学习问题。受移动健康应用的启发，我们专注于学习一种能使长期平均奖励最大化的策略。我们提出了一种用于平均奖励的双稳健估计器，并证明它能实现半参数效率。此外，我们开发了一种优化算法，以在参数化随机策略类中计算最优策略。估计策略的性能通过策略类中的最优平均奖励与估计策略的平均奖励之间的差异来衡量，并且我们建立了有限样本遗憾保证。通过模拟研究和对一项促进身体活动的移动健康研究的分析来说明该方法的性能。

相似文献

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Ann Stat. 2022 Dec;50(6):3364-3387. doi: 10.1214/22-aos2231. Epub 2022 Dec 21.

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions.

Neural Comput. 2016 Mar;28(3):563-93. doi: 10.1162/NECO_a_00808. Epub 2016 Jan 6.

Learning parametric policies and transition probability models of markov decision processes from data.

Eur J Control. 2021 Jan;57:68-75. doi: 10.1016/j.ejcon.2020.04.003. Epub 2020 May 26.

Parameterized MDPs and Reinforcement Learning Problems-A Maximum Entropy Principle-Based Framework.

IEEE Trans Cybern. 2022 Sep;52(9):9339-9351. doi: 10.1109/TCYB.2021.3102510. Epub 2022 Aug 18.

Learning to maximize reward rate: a model based on semi-Markov decision processes.

Front Neurosci. 2014 May 23;8:101. doi: 10.3389/fnins.2014.00101. eCollection 2014.

Partially observable Markov decision processes and performance sensitivity analysis.

IEEE Trans Syst Man Cybern B Cybern. 2008 Dec;38(6):1645-51. doi: 10.1109/TSMCB.2008.927711.

Quantile Markov Decision Processes.

Oper Res. 2022 May-Jun;70(3):1428-1447. doi: 10.1287/opre.2021.2123. Epub 2021 Nov 9.

Approximate robust policy iteration using multilayer perceptron neural networks for discounted infinite-horizon Markov decision processes with uncertain correlated transition matrices.

IEEE Trans Neural Netw. 2010 Aug;21(8):1270-80. doi: 10.1109/TNN.2010.2050334. Epub 2010 Jul 1.

Semi-Infinitely Constrained Markov Decision Processes and Provably Efficient Reinforcement Learning.

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3722-3735. doi: 10.1109/TPAMI.2023.3348460. Epub 2024 Apr 3.

Policy Iteration Algorithm for Optimal Control of Stochastic Logical Dynamical Systems.

IEEE Trans Neural Netw Learn Syst. 2018 May;29(5):2031-2036. doi: 10.1109/TNNLS.2017.2661863. Epub 2017 Mar 6.

本文引用的文献

Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health.

J Am Stat Assoc. 2021;116(533):382-391. doi: 10.1080/01621459.2020.1807993. Epub 2020 Oct 1.

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning.

J Am Stat Assoc. 2020;115(530):692-706. doi: 10.1080/01621459.2018.1537919. Epub 2019 Apr 17.

Resampling-based confidence intervals for model-free robust inference on optimal treatment regimes.

Biometrics. 2021 Jun;77(2):465-476. doi: 10.1111/biom.13337. Epub 2020 Aug 21.

Efficient augmentation and relaxation learning for individualized treatment rules using observational data.

J Mach Learn Res. 2019;20.

Precision Medicine.

Annu Rev Stat Appl. 2019 Mar;6:263-286. doi: 10.1146/annurev-statistics-030718-105251.

Efficacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps.

Ann Behav Med. 2019 May 3;53(6):573-582. doi: 10.1093/abm/kay067.

Residual Weighted Learning for Estimating Individualized Treatment Rules.

J Am Stat Assoc. 2017;112(517):169-187. doi: 10.1080/01621459.2015.1093947. Epub 2017 May 3.

Just-in-Time Adaptive Interventions (JITAIs) in Mobile Health: Key Components and Design Principles for Ongoing Health Behavior Support.

Ann Behav Med. 2018 May 18;52(6):446-462. doi: 10.1007/s12160-016-9830-8.

Sample size calculations for micro-randomized trials in mHealth.

Stat Med. 2016 May 30;35(12):1944-71. doi: 10.1002/sim.6847. Epub 2015 Dec 28.

Microrandomized trials: An experimental design for developing just-in-time adaptive interventions.

Health Psychol. 2015 Dec;34S(0):1220-8. doi: 10.1037/hea0000305.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

平均奖励马尔可夫决策过程中的批量策略学习

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献