已知最优平均回报的随机带臂赌博机的最优算法。

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.

DOI:10.1109/TNNLS.2020.2995920

Abstract

This brief studies a variation of the stochastic multiarmed bandit (MAB) problems, where the agent knows the a priori knowledge named the near-optimal mean reward (NoMR). In common MAB problems, an agent tries to find the optimal arm without knowing the optimal mean reward. However, in more practical applications, the agent can usually get an estimation of the optimal mean reward defined as NoMR. For instance, in an online Web advertising system based on MAB methods, a user's near-optimal average click rate (NoMR) can be roughly estimated from his/her demographic characteristics. As a result, application of the NoMR is efficient at improving the algorithm's performance. First, we formalize the stochastic MAB problem by knowing the NoMR that is in between the suboptimal mean reward and the optimal mean reward. Second, we use the cumulative regret as the performance metric for our problem, and we get that this problem's lower bound of the cumulative regret is Ω(1/∆) , where ∆ is the difference between the suboptimal mean reward and the optimal mean reward. Compared with the conventional MAB problem with the increasing logarithmic lower bound of the regret, our regret lower bound is uniform with the learning step. Third, a novel algorithm, NoMR-BANDIT, is set forth to solve this problem. In NoMR-BANDIT, the NoMR is used to design an efficient exploration strategy. In addition, we analyzed the regret's upper bound in NoMR-BANDIT and concluded that it also has a uniform upper bound of O(1/∆) , which is in the same order as the lower bound. Consequently, NoMR-BANDIT is an optimal algorithm of this problem. To enhance our method's generalization, CASCADE-BANDIT based on NoMR-BANDIT is proposed to solve the problem, where NoMR is less than the suboptimal mean reward. CASCADE-BANDIT has an upper bound of O(∆logn) , where n represents the learning step, and the order of O(∆logn) is the same with that of the conventional MAB methods. Finally, extensive experimental results demonstrated that the established NoMR-BANDIT is more efficient than the compared bandit solutions. After sufficient iterations, NOMR-BANDIT saved 10%-80% more cumulative regret than the state of the art.

摘要

本研究探讨了随机多臂赌博机（MAB）问题的一种变体，其中代理知道先验知识，即近最优平均奖励（NoMR）。在常见的 MAB 问题中，代理试图在不知道最优平均奖励的情况下找到最优臂。然而，在更实际的应用中，代理通常可以获得最优平均奖励的估计，定义为 NoMR。例如，在基于 MAB 方法的在线网络广告系统中，可以根据用户的人口统计特征大致估计用户的近最优平均点击率（NoMR）。因此，应用 NoMR 可以有效地提高算法的性能。首先，我们通过知道介于次优平均奖励和最优平均奖励之间的 NoMR 来形式化随机 MAB 问题。其次，我们使用累积后悔作为我们问题的性能指标，并且我们得到这个问题的累积后悔的下界是 Ω(1/∆)，其中 ∆ 是次优平均奖励和最优平均奖励之间的差异。与具有递增对数后悔下限的传统 MAB 问题相比，我们的后悔下限与学习步骤一致。第三，提出了一种新的算法 NoMR-BANDIT 来解决这个问题。在 NoMR-BANDIT 中，NoMR 用于设计一种有效的探索策略。此外，我们分析了 NoMR-BANDIT 中的后悔上限，并得出结论，它也具有一致的 O(1/∆)上限，这与下限相同。因此，NoMR-BANDIT 是该问题的最优算法。为了增强我们方法的泛化能力，提出了基于 NoMR-BANDIT 的 CASCADE-BANDIT 来解决问题，其中 NoMR 小于次优平均奖励。CASCADE-BANDIT 的上限为 O(∆logn)，其中 n 表示学习步骤，并且 O(∆logn)的顺序与传统 MAB 方法的顺序相同。最后，广泛的实验结果表明，所建立的 NoMR-BANDIT 比比较的博彩解决方案更有效。经过足够的迭代，NoMR-BANDIT 比最先进的技术节省了 10%-80%的累积后悔。

相似文献

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.已知最优平均回报的随机带臂赌博机的最优算法。

IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法：为何乐观值函数能在多臂老虎机问题中找到最优解？

Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.

An Online Minimax Optimal Algorithm for Adversarial Multiarmed Bandit Problem.一种用于对抗性多臂老虎机问题的在线极小极大最优算法。

IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5565-5580. doi: 10.1109/TNNLS.2018.2806006. Epub 2018 Mar 8.

A Thompson Sampling Algorithm With Logarithmic Regret for Unimodal Gaussian Bandit.一种针对单峰高斯博弈且具有对数遗憾值的汤普森采样算法。

IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5332-5341. doi: 10.1109/TNNLS.2023.3295360. Epub 2023 Sep 1.

Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.多项式时间算法，用于具有全带反馈的多臂识别。

Neural Comput. 2020 Sep;32(9):1733-1773. doi: 10.1162/neco_a_01299. Epub 2020 Jul 20.

Covariance Matrix Adaptation for Multiobjective Multiarmed Bandits.协方差矩阵适应的多目标多臂赌博机。

IEEE Trans Neural Netw Learn Syst. 2019 Aug;30(8):2493-2502. doi: 10.1109/TNNLS.2018.2885123. Epub 2018 Dec 28.

Minimax Optimal Bandits for Heavy Tail Rewards.重尾奖励的极小极大最优策略

IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):5280-5294. doi: 10.1109/TNNLS.2022.3203035. Epub 2024 Apr 4.

An Efficient Algorithm for Deep Stochastic Contextual Bandits.一种用于深度随机上下文博弈的高效算法。

Proc AAAI Conf Artif Intell. 2021 Feb;35(12):11193-11201.

Self-Unaware Adversarial Multi-Armed Bandits With Switching Costs.具有切换成本的自我 unaware 对抗性多臂老虎机

IEEE Trans Neural Netw Learn Syst. 2023 Jun;34(6):2908-2922. doi: 10.1109/TNNLS.2021.3110194. Epub 2023 Jun 1.

A Multiplier Bootstrap Approach to Designing Robust Algorithms for Contextual Bandits.一种用于为情境博弈设计稳健算法的乘数自助法。

IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9887-9899. doi: 10.1109/TNNLS.2022.3161806. Epub 2023 Nov 30.

已知最优平均回报的随机带臂赌博机的最优算法。

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.

DOI:10.1109/TNNLS.2020.2995920

PMID:32479408

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

已知最优平均回报的随机带臂赌博机的最优算法。

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

出版信息

相似文献

已知最优平均回报的随机带臂赌博机的最优算法。

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

出版信息

相似文献