Zhang Kelly W, Janson Lucas, Murphy Susan A
Department of Computer Science, Harvard University.
Departments of Statistics, Harvard University.
Adv Neural Inf Process Syst. 2020 Dec;33:9818-9829.
As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.
随着强盗算法在科学研究和工业应用中越来越多地被使用,基于由此自适应收集的数据,对可靠推理方法的需求也在相应增加。在这项工作中,我们开发了用于对使用强盗算法分批收集的数据进行推理的方法。我们首先证明,在独立采样数据上渐近正态的普通最小二乘估计器(OLS),在没有唯一最优臂的情况下,对于使用标准强盗算法收集的数据也是渐近正态的。这种渐近非正态性结果意味着,OLS估计器近似正态的天真假设可能导致第一类错误膨胀以及覆盖概率低于名义值的置信区间。其次,我们引入了分批OLS估计器(BOLS),我们证明它(1)在从多臂和上下文强盗收集的数据上渐近正态,并且(2)对基线奖励中的非平稳性具有鲁棒性。