• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

作者信息

Ochi Kento, Kamiura Moto

机构信息

Graduate School of Science and Engineering, Tokyo Denki University, Japan.

Graduate School of Science and Engineering, Tokyo Denki University, Japan; School of Science and Engineering, Tokyo Denki University, Japan; Research Institute of Electrical Communication, Tohoku University, Japan.

出版信息

Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.

DOI:10.1016/j.biosystems.2015.06.009
PMID:26166266
Abstract

A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. UCB algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since UCB algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of UCB algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform UCB algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks.

摘要

多臂赌博机问题是一种搜索问题,学习智能体必须在多个产生随机奖励的老虎机中选择最优的拉杆。UCB算法是解决多臂赌博机问题最流行的方法之一。它通过协调探索与利用之间的平衡实现对数遗憾性能。自UCB算法出现以来,研究人员通过经验发现乐观值函数在多臂赌博机问题中表现良好。乐观或乐观主义这些术语可能意味着值函数比奖励的样本均值足够大。UCB算法的第一个定义侧重于遗憾的优化,它不是直接基于值函数的乐观性。我们需要思考为什么乐观性在多臂赌博机问题中能带来良好性能。在本文中,我们提出一种新方法,称为超越方法,用于解决多臂赌博机问题。所提方法的值函数被定义为关于奖励期望值估计器的置信区间的上界:该值函数从上方渐近地趋近于奖励的期望值。如果值函数在渐近线之下大于期望值,那么学习智能体几乎肯定能够获得最优拉杆。这种结构被称为筛沙机制,次优拉杆的值函数不会再增长。这意味着学习智能体在每个时间步只能玩当前最优的拉杆。因此,所提方法实现了高精度率和低遗憾,并且其一些值函数可以优于UCB算法。本研究通过最简单的框架之一揭示了智能体在不确定环境中乐观性的优势。

相似文献

1
Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.
2
Optimism in the face of uncertainty supported by a statistically-designed multi-armed bandit algorithm.面对不确定性时的乐观态度由一种经过统计设计的多臂赌博机算法提供支持。
Biosystems. 2017 Oct;160:25-32. doi: 10.1016/j.biosystems.2017.08.004. Epub 2017 Aug 22.
3
Amoeba-inspired Tug-of-War algorithms for exploration-exploitation dilemma in extended Bandit Problem.用于扩展多臂老虎机问题中探索-利用困境的受变形虫启发的拔河算法。
Biosystems. 2014 Mar;117:1-9. doi: 10.1016/j.biosystems.2013.12.007. Epub 2013 Dec 31.
4
An empirical evaluation of active inference in multi-armed bandits.多臂赌博机中主动推理的实证评估。
Neural Netw. 2021 Dec;144:229-246. doi: 10.1016/j.neunet.2021.08.018. Epub 2021 Aug 26.
5
An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.已知最优平均回报的随机带臂赌博机的最优算法。
IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.
6
Some performance considerations when using multi-armed bandit algorithms in the presence of missing data.在存在缺失数据的情况下使用多臂赌博机算法时的一些性能考虑因素。
PLoS One. 2022 Sep 12;17(9):e0274272. doi: 10.1371/journal.pone.0274272. eCollection 2022.
7
Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function.保证满意与有限遗憾:一种认知满意价值函数的分析
Biosystems. 2019 Jun;180:46-53. doi: 10.1016/j.biosystems.2019.02.009. Epub 2019 Feb 27.
8
Cognitively inspired reinforcement learning architecture and its application to giant-swing motion control.受认知启发的强化学习架构及其在大摆动作控制中的应用。
Biosystems. 2014 Feb;116:1-9. doi: 10.1016/j.biosystems.2013.11.002. Epub 2013 Dec 1.
9
Risk-aware multi-armed bandit problem with application to portfolio selection.应用于投资组合选择的风险感知多臂老虎机问题。
R Soc Open Sci. 2017 Nov 15;4(11):171377. doi: 10.1098/rsos.171377. eCollection 2017 Nov.
10
Decision making for large-scale multi-armed bandit problems using bias control of chaotic temporal waveforms in semiconductor lasers.利用半导体激光器中混沌时间波形的偏差控制解决大规模多臂老虎机问题的决策方法。
Sci Rep. 2022 May 16;12(1):8073. doi: 10.1038/s41598-022-12155-y.