Atan Onur, Tekin Cem, van der Schaar Mihaela
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5798-5811. doi: 10.1109/TNNLS.2018.2818742. Epub 2018 Apr 12.
Multiarmed bandits (MABs) model sequential decision-making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior works on MAB assume that the reward distributions of each arm are independent. But in a wide variety of decision problems-from drug dosage to dynamic pricing-the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call global bandits (GB). In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves bounded regret, with a bound that depends on the single true parameter of the problem. Hence, this policy selects suboptimal arms only finitely many times with probability one. For this case, we also obtain a bound on regret that is independent of the true parameter; this bound is sublinear, with an exponent that depends on the informativeness of the arms. We also propose a variant of the greedy policy that achieves worst case and parameter-dependent regret. Finally, we perform experiments on dynamic pricing and show that the proposed algorithms achieve significant gains with respect to the well-known benchmarks.
多臂赌博机(MABs)用于对序列决策问题进行建模,在这类问题中,学习者按顺序选择奖励分布未知的臂,以最大化其累积奖励。之前关于多臂赌博机的大多数研究都假设每个臂的奖励分布是独立的。但在从药物剂量到动态定价等各种各样的决策问题中,不同臂的预期奖励是相关的,以至于选择一个臂也会提供有关其他臂预期奖励的信息。我们提出并分析了一类此类决策问题的模型,我们将其称为全局赌博机(GB)。在所有臂奖励都是单个未知参数的确定性函数的情况下,我们构建了一种贪婪策略,该策略能实现有界遗憾,其界取决于问题的单个真实参数。因此,该策略以概率1仅有限次选择次优臂。对于这种情况,我们还得到了一个与真实参数无关的遗憾界;这个界是次线性的,其指数取决于臂的信息性。我们还提出了一种贪婪策略的变体,它能实现最坏情况和与参数相关的遗憾。最后,我们对动态定价进行了实验,并表明所提出的算法相对于著名的基准算法有显著提升。