Gokcesu Kaan, Kozat Suleyman Serdar
IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5565-5580. doi: 10.1109/TNNLS.2018.2806006. Epub 2018 Mar 8.
We investigate the adversarial multiarmed bandit problem and introduce an online algorithm that asymptotically achieves the performance of the best switching bandit arm selection strategy. Our algorithms are truly online such that we do not use the game length or the number of switches of the best arm selection strategy in their constructions. Our results are guaranteed to hold in an individual sequence manner, since we have no statistical assumptions on the bandit arm losses. Our regret bounds, i.e., our performance bounds with respect to the best bandit arm selection strategy, are minimax optimal up to logarithmic terms. We achieve the minimax optimal regret with computational complexity only log-linear in the game length. Thus, our algorithms can be efficiently used in applications involving big data. Through an extensive set of experiments involving synthetic and real data, we demonstrate significant performance gains achieved by the proposed algorithm with respect to the state-of-the-art switching bandit algorithms. We also introduce a general efficiently implementable bandit arm selection framework, which can be adapted to various applications.
我们研究对抗性多臂老虎机问题,并引入一种在线算法,该算法渐近地实现了最佳切换老虎机臂选择策略的性能。我们的算法是真正在线的,以至于在其构造中不使用最佳臂选择策略的博弈长度或切换次数。由于我们对老虎机臂损失没有统计假设,我们的结果保证以个体序列的方式成立。我们的遗憾界,即相对于最佳老虎机臂选择策略的性能界,在对数项范围内是极小极大最优的。我们以仅与博弈长度成对数线性的计算复杂度实现了极小极大最优遗憾。因此,我们的算法可以有效地应用于涉及大数据的应用中。通过一系列涉及合成数据和真实数据的广泛实验,我们证明了所提出的算法相对于现有最先进的切换老虎机算法实现了显著的性能提升。我们还引入了一个通用的可有效实现的老虎机臂选择框架,该框架可适用于各种应用。