Ochi Kento, Kamiura Moto
Graduate School of Science and Engineering, Tokyo Denki University, Japan.
Graduate School of Science and Engineering, Tokyo Denki University, Japan; School of Science and Engineering, Tokyo Denki University, Japan; Research Institute of Electrical Communication, Tohoku University, Japan.
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.
A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. UCB algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since UCB algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of UCB algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform UCB algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks.
多臂赌博机问题是一种搜索问题,学习智能体必须在多个产生随机奖励的老虎机中选择最优的拉杆。UCB算法是解决多臂赌博机问题最流行的方法之一。它通过协调探索与利用之间的平衡实现对数遗憾性能。自UCB算法出现以来,研究人员通过经验发现乐观值函数在多臂赌博机问题中表现良好。乐观或乐观主义这些术语可能意味着值函数比奖励的样本均值足够大。UCB算法的第一个定义侧重于遗憾的优化,它不是直接基于值函数的乐观性。我们需要思考为什么乐观性在多臂赌博机问题中能带来良好性能。在本文中,我们提出一种新方法,称为超越方法,用于解决多臂赌博机问题。所提方法的值函数被定义为关于奖励期望值估计器的置信区间的上界:该值函数从上方渐近地趋近于奖励的期望值。如果值函数在渐近线之下大于期望值,那么学习智能体几乎肯定能够获得最优拉杆。这种结构被称为筛沙机制,次优拉杆的值函数不会再增长。这意味着学习智能体在每个时间步只能玩当前最优的拉杆。因此,所提方法实现了高精度率和低遗憾,并且其一些值函数可以优于UCB算法。本研究通过最简单的框架之一揭示了智能体在不确定环境中乐观性的优势。