Liu Jiamin, Lian Heng
IEEE Trans Neural Netw Learn Syst. 2025 Jun;36(6):10371-10380. doi: 10.1109/TNNLS.2024.3453036.
We investigate the decentralized nonparametric policy evaluation problem within reinforcement learning (RL), focusing on scenarios where multiple agents collaborate to learn the state-value function using sampled state transitions and privately observed rewards. Our approach centers on a regression-based multistage iteration technique employing infinite-dimensional gradient descent (GD) within a reproducing kernel Hilbert space (RKHS). To make computation and communication more feasible, we employ Nyström approximation to project this space into a finite-dimensional one. We establish statistical error bounds to describe the convergence of value function estimation, marking the first instance of such analysis within a fully decentralized nonparametric framework. We compare the regression-based method to the kernel temporal difference (TD) method in some numerical studies.
我们研究强化学习(RL)中的分散式非参数策略评估问题,重点关注多个智能体协作使用采样状态转移和私有观察奖励来学习状态值函数的场景。我们的方法以一种基于回归的多阶段迭代技术为核心,该技术在再生核希尔伯特空间(RKHS)内采用无限维梯度下降(GD)。为了使计算和通信更可行,我们采用Nyström近似将此空间投影到有限维空间。我们建立统计误差界来描述值函数估计的收敛性,这是在完全分散的非参数框架内进行此类分析的首次实例。在一些数值研究中,我们将基于回归的方法与核时间差分(TD)方法进行了比较。