Sledge Isaac J, Príncipe José C
Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA.
Computational NeuroEngineering Laboratory (CNEL), University of Florida, Gainesville, FL 32611, USA.
Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.
在本文中,我们为随机离散多臂赌博机提出了一种实现最优遗憾值的信息论探索策略。我们的策略基于信息价值准则。该准则衡量了策略信息与可获得奖励之间的权衡。大量的策略信息与对空间的探索主导型搜索相关联,并带来高奖励。少量的策略信息有利于对现有知识的利用。在此准则中,信息由一个在搜索过程中可以变化的参数来量化。我们证明,对该参数进行类似模拟退火的更新,并采用足够快的冷却进度表,会导致遗憾值相对于拉臂次数呈对数关系。