Tamatsukuri Akihiro, Takahashi Tatsuji
Graduate School of Advanced Science and Engineering, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama 350-0394, Japan.
School of Science and Engineering, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama 350-0394, Japan; Dwango Artificial Intelligence Laboratory, 5-24-5 Hongo, Bunkyo, Tokyo 113-0033, Japan.
Biosystems. 2019 Jun;180:46-53. doi: 10.1016/j.biosystems.2019.02.009. Epub 2019 Feb 27.
As reinforcement learning algorithms are being applied to increasingly complicated and realistic tasks, it is becoming increasingly difficult to solve such problems within a practical time frame. Hence, we focus on a satisficing strategy that looks for an action whose value is above the aspiration level (analogous to the break-even point), rather than the optimal action. In this paper, we introduce a simple mathematical model called risk-sensitive satisficing (RS) that implements a satisficing strategy by integrating risk-averse and risk-prone attitudes under the greedy policy. We apply the proposed model to the K-armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. The first is that RS is guaranteed to find an action whose value is above the aspiration level. The second is that the regret (expected loss) of RS is upper bounded by a finite value, given that the aspiration level is set to an "optimal level" so that satisficing implies optimizing. We confirm the results through numerical simulations and compare the performance of RS with that of other representative algorithms for the K-armed bandit problems.
随着强化学习算法被应用于越来越复杂和现实的任务,在实际时间框架内解决此类问题变得越来越困难。因此,我们关注一种满意策略,该策略寻找价值高于期望水平(类似于盈亏平衡点)的行动,而不是最优行动。在本文中,我们引入了一个名为风险敏感满意(RS)的简单数学模型,该模型通过在贪婪策略下整合风险规避和风险偏好态度来实现满意策略。我们将所提出的模型应用于K臂赌博机问题,这是强化学习任务中最基本的一类问题,并证明了两个命题。第一个命题是,RS保证能找到一个价值高于期望水平的行动。第二个命题是,假设期望水平被设定为“最优水平”,使得满意意味着优化,那么RS的遗憾值(预期损失)有一个有限的上界。我们通过数值模拟证实了这些结果,并将RS与K臂赌博机问题的其他代表性算法的性能进行了比较。