非平稳多臂赌博机：一种新概念漂移感知算法的实证评估

Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm.

作者信息

Cavenaghi Emanuele, Sottocornola Gabriele, Stella Fabio, Zanker Markus

机构信息

Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, 20126 Milano, Italy.

Facoltà di Scienze e Tecnologie Informatiche, Free University of Bozen-Bolzano, 39100 Bolzano, Italy.

出版信息

Entropy (Basel). 2021 Mar 23;23(3):380. doi: 10.3390/e23030380.

DOI:10.3390/e23030380

PMID:33807028

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8004723/

Abstract

The Multi-Armed Bandit (MAB) problem has been extensively studied in order to address real-world challenges related to sequential decision making. In this setting, an agent selects the best action to be performed at time-step , based on the past rewards received by the environment. This formulation implicitly assumes that the expected payoff for each action is kept stationary by the environment through time. Nevertheless, in many real-world applications this assumption does not hold and the agent has to face a non-stationary environment, that is, with a changing reward distribution. Thus, we present a new MAB algorithm, named (), for non-stationary environments, that is, when the data streaming is affected by . The algorithm is based on Thompson Sampling (TS) and exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. We investigate how to combine these two sources of information, namely the discount factor and the sliding window, by means of an aggregation function f(.). In particular, we proposed a pessimistic (f=min), an optimistic (f=max), as well as an averaged (f=mean) version of the algorithm. A rich set of numerical experiments is performed to evaluate the algorithm compared to both stationary and non-stationary state-of-the-art TS baselines. We exploited synthetic environments (both randomly-generated and controlled) to test the MAB algorithms under different types of drift, that is, sudden/abrupt, incremental, gradual and increasing/decreasing drift. Furthermore, we adapt four real-world active learning tasks to our framework-a prediction task on crimes in the city of Baltimore, a classification task on insects species, a recommendation task on local web-news, and a time-series analysis on microbial organisms in the tropical air ecosystem. The approach emerges as the best performing MAB algorithm. At least one of the versions of performs better than the baselines in synthetic environments, proving the robustness of under different concept drift types. Moreover, the pessimistic version (f=min) results as the most effective in all real-world tasks.

摘要

为应对与序列决策相关的现实世界挑战，多臂赌博机（MAB）问题已得到广泛研究。在这种情况下，智能体根据环境过去给予的奖励，在时间步(t)选择要执行的最佳动作。这种表述隐含地假设环境会使每个动作的预期收益随时间保持不变。然而，在许多实际应用中，这个假设并不成立，智能体必须面对非平稳环境，即奖励分布不断变化的环境。因此，我们提出了一种新的适用于非平稳环境的MAB算法，称为(())，即当数据流受(())影响时。该算法基于汤普森采样（TS），并利用奖励历史上的折扣因子和与臂相关的滑动窗口来应对非平稳环境中的概念漂移。我们研究如何通过聚合函数(f(.))将这两种信息源，即折扣因子和滑动窗口，结合起来。具体而言，我们提出了(())算法的悲观版本（(f = \min)）、乐观版本（(f = \max)）以及平均版本（(f = \text{mean})）。我们进行了一系列丰富的数值实验，将(())算法与平稳和非平稳的最先进TS基线进行比较。我们利用合成环境（包括随机生成的和可控的）在不同类型的漂移下测试MAB算法，即突然/急剧、增量、逐渐以及增加/减少漂移。此外，我们将四个实际的主动学习任务适配到我们的框架中——巴尔的摩市犯罪预测任务、昆虫物种分类任务、本地网络新闻推荐任务以及热带空气生态系统中微生物的时间序列分析。(())方法成为性能最佳的MAB算法。在合成环境中，(())的至少一个版本比基线表现更好，证明了(())在不同概念漂移类型下的鲁棒性。此外，悲观版本（(f = \min)）在所有实际任务中最为有效。