• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在线决策的统计推断:上下文博弈设置

Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting.

作者信息

Chen Haoyu, Lu Wenbin, Song Rui

机构信息

Department of Statistics, North Carolina State University.

出版信息

J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.

DOI:10.1080/01621459.2020.1770098
PMID:33737759
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7962379/
Abstract

Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The -greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!.

摘要

在线决策问题要求我们根据增量信息做出一系列决策。常见的解决方案通常需要根据上下文信息学习不同行动的奖励模型,然后最大化长期奖励。了解假设的模型是否合理以及该模型在渐近意义上的表现如何是有意义的。我们在具有线性奖励模型的上下文博弈框架设置下研究这个问题。采用ε-贪婪策略来解决经典的探索与利用困境。利用鞅中心极限定理,我们表明模型参数的在线普通最小二乘估计量是渐近正态的。当线性模型设定错误时,我们提出使用逆倾向得分加权的在线加权最小二乘估计量,并建立其渐近正态性。基于参数估计量的性质,我们进一步表明样本内逆倾向加权值估计量是渐近正态的。我们通过模拟和对雅虎新闻文章推荐数据集的应用来说明我们的结果。

相似文献

1
Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting.在线决策的统计推断:上下文博弈设置
J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.
2
Improving causal inference with a doubly robust estimator that combines propensity score stratification and weighting.利用结合倾向评分分层和加权的双重稳健估计器改进因果推断。
J Eval Clin Pract. 2017 Aug;23(4):697-702. doi: 10.1111/jep.12714. Epub 2017 Jan 24.
3
Inference for Batched Bandits.批量策略博弈的推断
Adv Neural Inf Process Syst. 2020 Dec;33:9818-9829.
4
Altered Statistical Learning and Decision-Making in Methamphetamine Dependence: Evidence from a Two-Armed Bandit Task.甲基苯丙胺成瘾中统计学习与决策的改变:来自双臂赌博任务的证据
Front Psychol. 2015 Dec 18;6:1910. doi: 10.3389/fpsyg.2015.01910. eCollection 2015.
5
Post-Contextual-Bandit Inference.后情境策略推理
Adv Neural Inf Process Syst. 2021 Dec;34:28548-28559.
6
A Multiplier Bootstrap Approach to Designing Robust Algorithms for Contextual Bandits.一种用于为情境博弈设计稳健算法的乘数自助法。
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9887-9899. doi: 10.1109/TNNLS.2022.3161806. Epub 2023 Nov 30.
7
Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework.基于超总体模型框架的预测均值匹配插补的渐近理论与推断
Scand Stat Theory Appl. 2020 Sep;47(3):839-861. doi: 10.1111/sjos.12429. Epub 2019 Nov 8.
8
Model misspecification and robustness in causal inference: comparing matching with doubly robust estimation.因果推断中的模型误设定与稳健性:比较匹配法和双重稳健估计。
Stat Med. 2012 Jul 10;31(15):1572-81. doi: 10.1002/sim.4496. Epub 2012 Feb 23.
9
Targeted estimation of nuisance parameters to obtain valid statistical inference.对干扰参数进行有针对性的估计以获得有效的统计推断。
Int J Biostat. 2014;10(1):29-57. doi: 10.1515/ijb-2012-0038.
10
Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.

引用本文的文献

1
Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials.来自序贯规则自适应试验的个体化治疗规则的非渐近性质。
J Mach Learn Res. 2022;23(250).
2
A single-index model with a surface-link for optimizing individualized dose rules.一种具有表面链接的单指数模型,用于优化个体化剂量规则。
J Comput Graph Stat. 2022;31(2):553-562. doi: 10.1080/10618600.2021.1923521. Epub 2021 Jun 21.
3
Statistical Inference with M-Estimators on Adaptively Collected Data.基于自适应收集数据的M估计量的统计推断。
Adv Neural Inf Process Syst. 2021 Dec;34:7460-7471.

本文引用的文献

1
TARGETED SEQUENTIAL DESIGN FOR TARGETED LEARNING INFERENCE OF THE OPTIMAL TREATMENT RULE AND ITS MEAN REWARD.用于最优治疗规则及其平均奖励的靶向学习推断的靶向序贯设计
Ann Stat. 2017;45(6):2537-2564. doi: 10.1214/16-AOS1534. Epub 2017 Dec 15.
2
Concordance-Assisted Learning for Estimating Optimal Individualized Treatment Regimes.用于估计最佳个体化治疗方案的一致性辅助学习
J R Stat Soc Series B Stat Methodol. 2017 Nov;79(5):1565-1582. doi: 10.1111/rssb.12216. Epub 2016 Oct 31.
3
Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions.用于序贯治疗决策的最优动态治疗方案的稳健估计。
Biometrika. 2013;100(3). doi: 10.1093/biomet/ast014.
4
Estimating Individualized Treatment Rules Using Outcome Weighted Learning.使用结果加权学习估计个体化治疗规则。
J Am Stat Assoc. 2012 Sep 1;107(449):1106-1118. doi: 10.1080/01621459.2012.695674.
5
A robust method for estimating optimal treatment regimes.一种估计最优治疗方案的稳健方法。
Biometrics. 2012 Dec;68(4):1010-8. doi: 10.1111/j.1541-0420.2012.01763.x. Epub 2012 May 2.
6
Reinforcement learning design for cancer clinical trials.强化学习在癌症临床试验中的设计。
Stat Med. 2009 Nov 20;28(26):3294-315. doi: 10.1002/sim.3720.