Qi Han, Guo Fei, Zhu Li
School of Software Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Entropy (Basel). 2025 Jan 9;27(1):51. doi: 10.3390/e27010051.
Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are O~(TBT), where is the time horizon and BT is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
非平稳多臂老虎机(MAB)问题近来受到了广泛关注。我们聚焦于奖励分布在特定时间段内保持不变且在未知时间步发生变化的突变场景。尽管汤普森采样(TS)在非平稳设置中已展现出成效,但目前尚无针对具有无信息先验的TS的遗憾界分析。为解决这一问题,我们提出了两种算法,即折扣TS和滑动窗口TS,它们专为次高斯奖励分布而设计。对于这些算法,我们通过限制次优臂被拉动的期望次数来建立期望遗憾的上界。我们证明,这两种算法的遗憾上界均为O~(TBT),其中T是时间范围,BT是断点数量。该上界与突变问题的下界在对数因子范围内相匹配。与其他非平稳老虎机算法的实证比较突出了我们所提出方法的竞争性能。