IEEE Trans Cybern. 2014 Dec;44(12):2613-25. doi: 10.1109/TCYB.2014.2311578. Epub 2014 Apr 25.
In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings.
为了解决具有大或连续状态空间的序贯决策问题,特征表示和函数逼近一直是强化学习(RL)的主要研究课题。本文提出了一种基于聚类的图拉普拉斯框架,用于 RL 中的特征表示和价值函数逼近(VFA)。通过利用基于聚类的技术,即 K-均值聚类或模糊 C-均值聚类,在具有连续状态空间的马尔可夫决策过程(MDP)中通过子采样构建图拉普拉斯。VFA 的基函数可以从图拉普拉斯的谱分析中自动生成。基于聚类的图拉普拉斯与一类称为表示策略迭代(RPI)的近似策略迭代算法相结合,用于具有连续状态空间的 MDP 中的 RL。仿真和实验结果表明,与以前的 RPI 方法相比,所提出的方法需要更少的样本点来计算一组有效的基函数,并且可以针对各种参数设置来提高学习控制性能。