Chen Haoyu, Lu Wenbin, Song Rui
Department of Statistics, North Carolina State University.
J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.
Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The -greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!.
在线决策问题要求我们根据增量信息做出一系列决策。常见的解决方案通常需要根据上下文信息学习不同行动的奖励模型,然后最大化长期奖励。了解假设的模型是否合理以及该模型在渐近意义上的表现如何是有意义的。我们在具有线性奖励模型的上下文博弈框架设置下研究这个问题。采用ε-贪婪策略来解决经典的探索与利用困境。利用鞅中心极限定理,我们表明模型参数的在线普通最小二乘估计量是渐近正态的。当线性模型设定错误时,我们提出使用逆倾向得分加权的在线加权最小二乘估计量,并建立其渐近正态性。基于参数估计量的性质,我们进一步表明样本内逆倾向加权值估计量是渐近正态的。我们通过模拟和对雅虎新闻文章推荐数据集的应用来说明我们的结果。