• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

全球匪帮

Global Bandits.

作者信息

Atan Onur, Tekin Cem, van der Schaar Mihaela

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5798-5811. doi: 10.1109/TNNLS.2018.2818742. Epub 2018 Apr 12.

DOI:10.1109/TNNLS.2018.2818742
PMID:29993936
Abstract

Multiarmed bandits (MABs) model sequential decision-making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior works on MAB assume that the reward distributions of each arm are independent. But in a wide variety of decision problems-from drug dosage to dynamic pricing-the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call global bandits (GB). In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves bounded regret, with a bound that depends on the single true parameter of the problem. Hence, this policy selects suboptimal arms only finitely many times with probability one. For this case, we also obtain a bound on regret that is independent of the true parameter; this bound is sublinear, with an exponent that depends on the informativeness of the arms. We also propose a variant of the greedy policy that achieves worst case and parameter-dependent regret. Finally, we perform experiments on dynamic pricing and show that the proposed algorithms achieve significant gains with respect to the well-known benchmarks.

摘要

多臂赌博机(MABs)用于对序列决策问题进行建模,在这类问题中,学习者按顺序选择奖励分布未知的臂,以最大化其累积奖励。之前关于多臂赌博机的大多数研究都假设每个臂的奖励分布是独立的。但在从药物剂量到动态定价等各种各样的决策问题中,不同臂的预期奖励是相关的,以至于选择一个臂也会提供有关其他臂预期奖励的信息。我们提出并分析了一类此类决策问题的模型,我们将其称为全局赌博机(GB)。在所有臂奖励都是单个未知参数的确定性函数的情况下,我们构建了一种贪婪策略,该策略能实现有界遗憾,其界取决于问题的单个真实参数。因此,该策略以概率1仅有限次选择次优臂。对于这种情况,我们还得到了一个与真实参数无关的遗憾界;这个界是次线性的,其指数取决于臂的信息性。我们还提出了一种贪婪策略的变体,它能实现最坏情况和与参数相关的遗憾。最后,我们对动态定价进行了实验,并表明所提出的算法相对于著名的基准算法有显著提升。

相似文献

1
Global Bandits.全球匪帮
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5798-5811. doi: 10.1109/TNNLS.2018.2818742. Epub 2018 Apr 12.
2
Minimax Optimal Bandits for Heavy Tail Rewards.重尾奖励的极小极大最优策略
IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):5280-5294. doi: 10.1109/TNNLS.2022.3203035. Epub 2024 Apr 4.
3
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis.带噪声上下文的随机博弈的汤普森采样:一种信息论后悔分析
Entropy (Basel). 2024 Jul 17;26(7):606. doi: 10.3390/e26070606.
4
An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.已知最优平均回报的随机带臂赌博机的最优算法。
IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.
5
Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits.理解序贯决策过程的随机动力学:多臂赌博机的路径积分分析。
Chaos. 2023 Jun 1;33(6). doi: 10.1063/5.0120076.
6
Covariance Matrix Adaptation for Multiobjective Multiarmed Bandits.协方差矩阵适应的多目标多臂赌博机。
IEEE Trans Neural Netw Learn Syst. 2019 Aug;30(8):2493-2502. doi: 10.1109/TNNLS.2018.2885123. Epub 2018 Dec 28.
7
Self-Unaware Adversarial Multi-Armed Bandits With Switching Costs.具有切换成本的自我 unaware 对抗性多臂老虎机
IEEE Trans Neural Netw Learn Syst. 2023 Jun;34(6):2908-2922. doi: 10.1109/TNNLS.2021.3110194. Epub 2023 Jun 1.
8
Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.
9
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits.探索随机离散多臂老虎机时信息价值的分析
Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.
10
Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.多项式时间算法,用于具有全带反馈的多臂识别。
Neural Comput. 2020 Sep;32(9):1733-1773. doi: 10.1162/neco_a_01299. Epub 2020 Jul 20.