Xie Hong, Tang Qiao, Zhu Qingsheng
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9887-9899. doi: 10.1109/TNNLS.2022.3161806. Epub 2023 Nov 30.
Upper confidence bound (UCB)-based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrap technique. Our proposed estimator mitigates the problem of specifying a heavier tail property by adaptively converging to the ground truth contextual bandit UCB (i.e., eliminating the impact of the specified heavier tail property) with theoretical guarantees on the convergence. The design and convergence analysis of the proposed estimator is technically nontrivial. The proposed estimator is generic and it can be applied to improve a variety of UCB-based contextual bandit algorithms. To demonstrate the versatility of the proposed estimator, we apply it to improve the linear reward contextual bandit UCB (LinUCB) algorithm resulting in our bootstrapping LinUCB (BootLinUCB) algorithm. We prove that the BootLinUCB has a sublinear regret. We conduct extensive experiments on both synthetic dataset and real-world dataset from Yahoo! to validate the benefits of our proposed estimator in reducing regret and the superior performance of BootLinUCB over the latest baseline.
基于上置信界(UCB)的上下文博弈算法要求人们了解奖励分布的尾部性质。不幸的是,在实际应用中,这种尾部性质通常是未知的或难以确定的。使用比真实情况更重的尾部性质会导致上下文博弈算法的学习速度缓慢,而使用更轻的尾部性质可能会导致算法发散。为了解决这个基本问题,我们基于乘数自助法技术开发了一种用于上下文博弈UCB的估计器(根据历史奖励进行评估)。我们提出的估计器通过自适应地收敛到真实的上下文博弈UCB(即消除指定的更重尾部性质的影响)来减轻指定更重尾部性质的问题,并在收敛方面有理论保证。所提出的估计器的设计和收敛分析在技术上并非易事。所提出的估计器是通用的,可应用于改进各种基于UCB的上下文博弈算法。为了证明所提出估计器的通用性,我们将其应用于改进线性奖励上下文博弈UCB(LinUCB)算法,从而得到我们的自助LinUCB(BootLinUCB)算法。我们证明BootLinUCB具有次线性遗憾。我们在合成数据集和雅虎的真实世界数据集上进行了广泛的实验,以验证我们提出的估计器在减少遗憾方面的好处以及BootLinUCB相对于最新基线的优越性能。