IEEE Trans Neural Netw Learn Syst. 2019 Aug;30(8):2493-2502. doi: 10.1109/TNNLS.2018.2885123. Epub 2018 Dec 28.
Upper confidence bound (UCB) is a successful multiarmed bandit for regret minimization. The covariance matrix adaptation (CMA) for Pareto UCB (CMA-PUCB) algorithm considers stochastic reward vectors with correlated objectives. We upper bound the cumulative pseudoregret of pulling suboptimal arms for the CMA-PUCB algorithm to logarithmic number of arms K , objectives D , and samples n , O (ln(nDK) ∑ (|| Σ ||/∆)) , using a variant of Berstein inequality for matrices, where ∆ is the regret of pulling the suboptimal arm i . For unknown covariance matrices between objectives Σ , we upper bound the approximation of the covariance matrix using the number of samples to O (n ln(nDK) + ln(nDK) ∑ (1/∆)) . Simulations on a three objective stochastic environment show the applicability of our method.
上置信界(UCB)是一种用于最小化遗憾的成功的多臂老虎机。协方差矩阵自适应(CMA)用于帕累托 UCB(CMA-PUCB)算法,考虑了具有相关目标的随机奖励向量。我们将 CMA-PUCB 算法中拉取次优臂的累积伪遗憾上界限定为对数臂数 K、目标数 D 和样本数 n,O(ln(nDK)∑(||Σ||/∆)),使用矩阵的 Berstein 不等式的变体,其中 ∆是拉取次优臂 i 的遗憾。对于目标之间未知的协方差矩阵 Σ,我们将使用样本数将协方差矩阵的近似值上界限定为 O(nln(nDK) + ln(nDK)∑(1/∆))。在一个具有三个目标的随机环境中的仿真表明了我们方法的适用性。