多项式时间算法，用于具有全带反馈的多臂识别。

Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.

机构信息

University of Tokyo, Bunkyo-ku, Tokyo, 113-0333, Japan, and RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo 103-0027, Japan

RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo 103-0027, Japan

出版信息

Neural Comput. 2020 Sep;32(9):1733-1773. doi: 10.1162/neco_a_01299. Epub 2020 Jul 20.

DOI:10.1162/neco_a_01299

PMID:32687769

Abstract

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size- subset of arms (also known as a ) from given arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is (log ), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10 super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.

摘要

我们研究了随机多臂识别问题，其中代理者从给定的个臂中顺序探索一个大小为的子集（也称为），并试图识别最佳超级臂。到目前为止，大多数工作都考虑了半臂问题设置，其中代理者可以观察每个拉出的臂的奖励，或者假设每个臂在每一轮都可以被查询。然而，在实际应用中，观察单个臂的奖励是昂贵的，有时甚至是不可能的。在这项研究中，我们解决了全臂问题设置，其中在每次抽取时只给出超级臂的总和的一个噪声观测值。虽然我们的问题可以被视为线性带臂中的最佳臂识别的一个实例，但基于线性带臂的简单方法在计算上是不可行的，因为超级臂的数量是指数级的。为了解决这个问题，我们首先设计了一种用于置信椭球最大化中出现的 0-1 二次规划问题的多项式时间逼近算法。基于我们的逼近算法，我们提出了一种带臂算法，其计算时间为 (log )，从而比线性带臂算法实现了指数级的加速。我们提供了一个样本复杂度上界，仍然是最坏情况下最优的。最后，我们在具有超过 10 个超级臂的大规模数据集上进行了实验，证明了我们的算法在计算时间和样本复杂度方面的优越性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

多项式时间算法，用于具有全带反馈的多臂识别。

Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.

机构信息

出版信息

相似文献

多项式时间算法，用于具有全带反馈的多臂识别。

Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.

机构信息

出版信息

相似文献