Barbier-Chebbah Alex, Vestergaard Christian L, Masson Jean-Baptiste
Institut Pasteur, Université Paris Cité, CNRS UMR 3571, Decision and Bayesian Computation, 75015 Paris, France.
Épimethée, Inria, 75012 Paris, France.
Phys Rev E. 2024 May;109(5):L052105. doi: 10.1103/PhysRevE.109.L052105.
This paper addresses the exploration-exploitation dilemma inherent in decision-making, focusing on multiarmed bandit problems. These involve an agent deciding whether to exploit current knowledge for immediate gains or explore new avenues for potential long-term rewards. We here introduce a class of algorithms, approximate information maximization (AIM), which employs a carefully chosen analytical approximation to the gradient of the entropy to choose which arm to pull at each point in time. AIM matches the performance of Thompson sampling, which is known to be asymptotically optimal, as well as that of Infomax from which it derives. AIM thus retains the advantages of Infomax while also offering enhanced computational speed, tractability, and ease of implementation. In particular, we demonstrate how to apply it to a 50-armed bandit game. Its expression is tunable, which allows for specific optimization in various settings, making it possible to surpass the performance of Thompson sampling at short and intermediary times.
本文探讨了决策过程中固有的探索与利用困境,重点关注多臂老虎机问题。这些问题涉及一个智能体决定是利用当前知识获取即时收益,还是探索新途径以获取潜在的长期回报。我们在此引入一类算法,即近似信息最大化(AIM),它采用精心选择的对熵梯度的解析近似来决定在每个时间点拉动哪一个臂。AIM与已知渐近最优的汤普森采样以及它所衍生的信息最大化算法的性能相匹配。因此,AIM保留了信息最大化的优势,同时还提高了计算速度、可处理性和易于实现性。特别是,我们展示了如何将其应用于一个50臂老虎机游戏。其表达式是可调的,这允许在各种设置中进行特定优化,从而有可能在短期和中期超越汤普森采样的性能。