Suppr超能文献

基于样本复用的奖励加权回归的强化学习中直接策略搜索

Reward-weighted regression with sample reuse for direct policy search in reinforcement learning.

机构信息

Tokyo Institute of Technology, O-okayama, Meguro-ku, Tokyo 152-8552, Japan.

出版信息

Neural Comput. 2011 Nov;23(11):2798-832. doi: 10.1162/NECO_a_00199. Epub 2011 Aug 18.

Abstract

Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R3), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .).

摘要

直接策略搜索是一种很有前途的强化学习框架,特别适用于控制连续的、高维系统。策略搜索通常需要大量的样本才能获得稳定的策略更新估计值,而当采样成本很高时,这是不可行的。在这封信中,我们扩展了一种基于期望最大化的策略搜索方法,以便能够有效地重用以前收集的样本。通过机器人学习实验,验证了所提出的方法(具有样本重用的奖励加权回归(R3))的有效性。(这封信是我们之前的会议论文的扩展版本:Hachiya、Peters 和 Sugiyama,2009 年)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验