Oommen B J, Agache M
Sch. of Comput. Sci., Carleton Univ., Ottawa, Ont.
IEEE Trans Syst Man Cybern B Cybern. 2001;31(3):277-87. doi: 10.1109/3477.931507.
A learning automaton (LA) is an automaton that interacts with a random environment, having as its goal the task of learning the optimal action based on its acquired experience. Many learning automata (LAs) have been proposed, with the class of estimator algorithms being among the fastest ones, Thathachar and Sastry, through the pursuit algorithm, introduced the concept of learning algorithms that pursue the current optimal action, following a reward-penalty learning philosophy. Later, Oommen and Lanctot extended the pursuit algorithm into the discretized world by presenting the discretized pursuit algorithm, based on a reward-inaction learning philosophy. In this paper we argue that the reward-penalty and reward-inaction learning paradigms in conjunction with the continuous and discrete models of computation, lead to four versions of pursuit learning automata. We contend that a scheme that merges the pursuit concept with the most recent response of the environment, permits the algorithm to utilize the LAs long-term and short-term perspectives of the environment. In this paper, we present all four resultant pursuit algorithms, prove the E-optimality of the newly introduced algorithms, and present a quantitative comparison between them.
学习自动机(LA)是一种与随机环境交互的自动机,其目标是基于所获得的经验学习最优动作的任务。已经提出了许多学习自动机(LA),估计器算法类别是其中最快的算法之一,塔哈查尔和萨斯特里通过追踪算法引入了遵循奖惩学习理念追踪当前最优动作的学习算法概念。后来,奥门和兰科托通过提出离散化追踪算法,基于奖惩无为学习理念将追踪算法扩展到离散世界。在本文中,我们认为奖惩和奖惩无为学习范式与连续和离散计算模型相结合,导致了四种版本的追踪学习自动机。我们认为,一种将追踪概念与环境的最新响应相结合的方案,允许算法利用学习自动机对环境的长期和短期视角。在本文中,我们给出了所有四种由此产生的追踪算法,证明了新引入算法的E最优性,并对它们进行了定量比较。