• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探索随机离散多臂老虎机时信息价值的分析

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits.

作者信息

Sledge Isaac J, Príncipe José C

机构信息

Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA.

Computational NeuroEngineering Laboratory (CNEL), University of Florida, Gainesville, FL 32611, USA.

出版信息

Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.

DOI:10.3390/e20030155
PMID:33265246
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7512671/
Abstract

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.

摘要

在本文中,我们为随机离散多臂赌博机提出了一种实现最优遗憾值的信息论探索策略。我们的策略基于信息价值准则。该准则衡量了策略信息与可获得奖励之间的权衡。大量的策略信息与对空间的探索主导型搜索相关联,并带来高奖励。少量的策略信息有利于对现有知识的利用。在此准则中,信息由一个在搜索过程中可以变化的参数来量化。我们证明,对该参数进行类似模拟退火的更新,并采用足够快的冷却进度表,会导致遗憾值相对于拉臂次数呈对数关系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/99de2632d228/entropy-20-00155-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/91950d7d7117/entropy-20-00155-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/1b42a48ff66a/entropy-20-00155-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/4d8bf1442ec8/entropy-20-00155-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/16987912b650/entropy-20-00155-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/1441b0df1031/entropy-20-00155-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/99de2632d228/entropy-20-00155-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/91950d7d7117/entropy-20-00155-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/1b42a48ff66a/entropy-20-00155-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/4d8bf1442ec8/entropy-20-00155-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/16987912b650/entropy-20-00155-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/1441b0df1031/entropy-20-00155-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f22/7512671/99de2632d228/entropy-20-00155-g006.jpg

相似文献

1
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits.探索随机离散多臂老虎机时信息价值的分析
Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.
2
Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.
3
Finding structure in multi-armed bandits.在多臂老虎机中寻找结构。
Cogn Psychol. 2020 Jun;119:101261. doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.
4
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis.带噪声上下文的随机博弈的汤普森采样:一种信息论后悔分析
Entropy (Basel). 2024 Jul 17;26(7):606. doi: 10.3390/e26070606.
5
Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits.理解序贯决策过程的随机动力学:多臂赌博机的路径积分分析。
Chaos. 2023 Jun 1;33(6). doi: 10.1063/5.0120076.
6
Mating with Multi-Armed Bandits: Reinforcement Learning Models of Human Mate Search.与多臂赌博机的匹配:人类配偶搜索的强化学习模型
Open Mind (Camb). 2024 Aug 15;8:995-1011. doi: 10.1162/opmi_a_00156. eCollection 2024.
7
Global Bandits.全球匪帮
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5798-5811. doi: 10.1109/TNNLS.2018.2818742. Epub 2018 Apr 12.
8
Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion.基于信息价值不确定性准则的马尔可夫决策过程引导策略探索。
IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.
9
On Gap-Based Lower Bounding Techniques for Best-Arm Identification.关于最佳臂识别的基于间隙的下界技术
Entropy (Basel). 2020 Jul 20;22(7):788. doi: 10.3390/e22070788.
10
Minimax Optimal Bandits for Heavy Tail Rewards.重尾奖励的极小极大最优策略
IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):5280-5294. doi: 10.1109/TNNLS.2022.3203035. Epub 2024 Apr 4.

引用本文的文献

1
Information Structures for Causally Explainable Decisions.用于因果可解释决策的信息结构
Entropy (Basel). 2021 May 13;23(5):601. doi: 10.3390/e23050601.
2
Reduction of Markov Chains Using a Value-of-Information-Based Approach.基于信息价值方法的马尔可夫链约简
Entropy (Basel). 2019 Mar 30;21(4):349. doi: 10.3390/e21040349.

本文引用的文献

1
Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion.基于信息价值不确定性准则的马尔可夫决策过程引导策略探索。
IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.