Suppr超能文献

在线决策的统计推断:上下文博弈设置

Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting.

作者信息

Chen Haoyu, Lu Wenbin, Song Rui

机构信息

Department of Statistics, North Carolina State University.

出版信息

J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.

Abstract

Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The -greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!.

摘要

在线决策问题要求我们根据增量信息做出一系列决策。常见的解决方案通常需要根据上下文信息学习不同行动的奖励模型,然后最大化长期奖励。了解假设的模型是否合理以及该模型在渐近意义上的表现如何是有意义的。我们在具有线性奖励模型的上下文博弈框架设置下研究这个问题。采用ε-贪婪策略来解决经典的探索与利用困境。利用鞅中心极限定理,我们表明模型参数的在线普通最小二乘估计量是渐近正态的。当线性模型设定错误时,我们提出使用逆倾向得分加权的在线加权最小二乘估计量,并建立其渐近正态性。基于参数估计量的性质,我们进一步表明样本内逆倾向加权值估计量是渐近正态的。我们通过模拟和对雅虎新闻文章推荐数据集的应用来说明我们的结果。

相似文献

1
Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting.在线决策的统计推断:上下文博弈设置
J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.
3
Inference for Batched Bandits.批量策略博弈的推断
Adv Neural Inf Process Syst. 2020 Dec;33:9818-9829.
5
Post-Contextual-Bandit Inference.后情境策略推理
Adv Neural Inf Process Syst. 2021 Dec;34:28548-28559.
6
A Multiplier Bootstrap Approach to Designing Robust Algorithms for Contextual Bandits.一种用于为情境博弈设计稳健算法的乘数自助法。
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9887-9899. doi: 10.1109/TNNLS.2022.3161806. Epub 2023 Nov 30.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验