基于样本复用的奖励加权回归的强化学习中直接策略搜索

Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

机构信息

Tokyo Institute of Technology, O-okayama, Meguro-ku, Tokyo 152-8552, Japan.

出版信息

Neural Comput. 2011 Nov;23(11):2798-832. doi: 10.1162/NECO_a_00199. Epub 2011 Aug 18.

Abstract

Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R3), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .).

摘要

直接策略搜索是一种很有前途的强化学习框架，特别适用于控制连续的、高维系统。策略搜索通常需要大量的样本才能获得稳定的策略更新估计值，而当采样成本很高时，这是不可行的。在这封信中，我们扩展了一种基于期望最大化的策略搜索方法，以便能够有效地重用以前收集的样本。通过机器人学习实验，验证了所提出的方法（具有样本重用的奖励加权回归（R3））的有效性。（这封信是我们之前的会议论文的扩展版本：Hachiya、Peters 和 Sugiyama，2009 年）。

相似文献

Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

Neural Comput. 2011 Nov;23(11):2798-832. doi: 10.1162/NECO_a_00199. Epub 2011 Aug 18.

Efficient exploration through active learning for value function approximation in reinforcement learning.

Neural Netw. 2010 Jun;23(5):639-48. doi: 10.1016/j.neunet.2009.12.010. Epub 2010 Jan 11.

Efficient sample reuse in policy gradients with parameter-based exploration.

Neural Comput. 2013 Jun;25(6):1512-47. doi: 10.1162/NECO_a_00452. Epub 2013 Mar 21.

Adaptive importance sampling for value function approximation in off-policy reinforcement learning.

Neural Netw. 2009 Dec;22(10):1399-410. doi: 10.1016/j.neunet.2009.01.002. Epub 2009 Jan 23.

Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning.

Neural Comput. 2010 Feb;22(2):342-76. doi: 10.1162/neco.2009.12-08-922.

Model-based reinforcement learning with dimension reduction.

Neural Netw. 2016 Dec;84:1-16. doi: 10.1016/j.neunet.2016.08.005. Epub 2016 Aug 24.

Autonomous reinforcement learning with experience replay.

Neural Netw. 2013 May;41:156-67. doi: 10.1016/j.neunet.2012.11.007. Epub 2012 Nov 29.

Posterior weighted reinforcement learning with state uncertainty.

Neural Comput. 2010 May;22(5):1149-79. doi: 10.1162/neco.2010.01-09-948.

Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback.

IEEE Trans Neural Netw. 2008 Feb;19(2):230-44. doi: 10.1109/TNN.2007.905839.

Dimensional reduction for reward-based learning.

Network. 2006 Sep;17(3):235-52. doi: 10.1080/09548980600773215.

引用本文的文献

Adaptive Baseline Enhances EM-Based Policy Search: Validation in a View-Based Positioning Task of a Smartphone Balancer.

Front Neurorobot. 2017 Jan 23;11:1. doi: 10.3389/fnbot.2017.00001. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于样本复用的奖励加权回归的强化学习中直接策略搜索

Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献