• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

批量策略博弈的推断

Inference for Batched Bandits.

作者信息

Zhang Kelly W, Janson Lucas, Murphy Susan A

机构信息

Department of Computer Science, Harvard University.

Departments of Statistics, Harvard University.

出版信息

Adv Neural Inf Process Syst. 2020 Dec;33:9818-9829.

PMID:35002190
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8734616/
Abstract

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.

摘要

随着强盗算法在科学研究和工业应用中越来越多地被使用,基于由此自适应收集的数据,对可靠推理方法的需求也在相应增加。在这项工作中,我们开发了用于对使用强盗算法分批收集的数据进行推理的方法。我们首先证明,在独立采样数据上渐近正态的普通最小二乘估计器(OLS),在没有唯一最优臂的情况下,对于使用标准强盗算法收集的数据也是渐近正态的。这种渐近非正态性结果意味着,OLS估计器近似正态的天真假设可能导致第一类错误膨胀以及覆盖概率低于名义值的置信区间。其次,我们引入了分批OLS估计器(BOLS),我们证明它(1)在从多臂和上下文强盗收集的数据上渐近正态,并且(2)对基线奖励中的非平稳性具有鲁棒性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/73d0440db48c/nihms-1641560-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/613461972a0c/nihms-1641560-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/1d81756041e3/nihms-1641560-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/16f24a691ffe/nihms-1641560-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/caf8f8cd9ab3/nihms-1641560-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/6703093b61c4/nihms-1641560-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/eb1739cb54ff/nihms-1641560-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/833e8c2bd0cd/nihms-1641560-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/73d0440db48c/nihms-1641560-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/613461972a0c/nihms-1641560-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/1d81756041e3/nihms-1641560-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/16f24a691ffe/nihms-1641560-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/caf8f8cd9ab3/nihms-1641560-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/6703093b61c4/nihms-1641560-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/eb1739cb54ff/nihms-1641560-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/833e8c2bd0cd/nihms-1641560-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc06/8734616/73d0440db48c/nihms-1641560-f0005.jpg

相似文献

1
Inference for Batched Bandits.批量策略博弈的推断
Adv Neural Inf Process Syst. 2020 Dec;33:9818-9829.
2
A Multiplier Bootstrap Approach to Designing Robust Algorithms for Contextual Bandits.一种用于为情境博弈设计稳健算法的乘数自助法。
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9887-9899. doi: 10.1109/TNNLS.2022.3161806. Epub 2023 Nov 30.
3
Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting.在线决策的统计推断:上下文博弈设置
J Am Stat Assoc. 2021;116(533):240-255. doi: 10.1080/01621459.2020.1770098. Epub 2020 Jul 7.
4
Statistical Inference with M-Estimators on Adaptively Collected Data.基于自适应收集数据的M估计量的统计推断。
Adv Neural Inf Process Syst. 2021 Dec;34:7460-7471.
5
Post-Contextual-Bandit Inference.后情境策略推理
Adv Neural Inf Process Syst. 2021 Dec;34:28548-28559.
6
An empirical evaluation of active inference in multi-armed bandits.多臂赌博机中主动推理的实证评估。
Neural Netw. 2021 Dec;144:229-246. doi: 10.1016/j.neunet.2021.08.018. Epub 2021 Aug 26.
7
A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials.一种基于情境博弈的临床试验明智决策方法。
Life (Basel). 2022 Aug 21;12(8):1277. doi: 10.3390/life12081277.
8
Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback.多项式时间算法,用于具有全带反馈的多臂识别。
Neural Comput. 2020 Sep;32(9):1733-1773. doi: 10.1162/neco_a_01299. Epub 2020 Jul 20.
9
An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.已知最优平均回报的随机带臂赌博机的最优算法。
IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2285-2291. doi: 10.1109/TNNLS.2020.2995920. Epub 2021 May 3.
10
Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法:为何乐观值函数能在多臂老虎机问题中找到最优解?
Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.

引用本文的文献

1
Adaptive randomization methods for sequential multiple assignment randomized trials (smarts) via thompson sampling.通过汤普森抽样实现序贯多分配随机试验(SMARTs)的自适应随机化方法。
Biometrics. 2024 Oct 3;80(4). doi: 10.1093/biomtc/ujae152.
2
Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling.我们进行个性化了吗?使用重采样通过在线强化学习算法评估个性化。
Mach Learn. 2024 Jul;113(7):3961-3997. doi: 10.1007/s10994-024-06526-x. Epub 2024 Apr 10.
3
Reward Design For An Online Reinforcement Learning Algorithm Supporting Oral Self-Care.

本文引用的文献

1
Power Constrained Bandits.功率受限的强盗算法
Proc Mach Learn Res. 2021 Aug;149:209-259.
2
Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity.个性化心脏运动计划:一种用于优化身体活动的强化学习算法
Proc ACM Interact Mob Wearable Ubiquitous Technol. 2020 Mar;4(1). doi: 10.1145/3381007.
3
Confidence intervals for policy evaluation in adaptive experiments.自适应试验中政策评估的置信区间。
支持口腔自我护理的在线强化学习算法的奖励设计
Proc Innov Appl Artif Intell Conf. 2023 Jun 27;37(13):15724-15730. doi: 10.1609/aaai.v37i13.26866.
4
Some performance considerations when using multi-armed bandit algorithms in the presence of missing data.在存在缺失数据的情况下使用多臂赌博机算法时的一些性能考虑因素。
PLoS One. 2022 Sep 12;17(9):e0274272. doi: 10.1371/journal.pone.0274272. eCollection 2022.
5
Statistical Inference with M-Estimators on Adaptively Collected Data.基于自适应收集数据的M估计量的统计推断。
Adv Neural Inf Process Syst. 2021 Dec;34:7460-7471.
6
Confidence intervals for policy evaluation in adaptive experiments.自适应试验中政策评估的置信区间。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2014602118.
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2014602118.
4
Scaling up behavioral science interventions in online education.将行为科学干预措施在在线教育中规模化。
Proc Natl Acad Sci U S A. 2020 Jun 30;117(26):14900-14905. doi: 10.1073/pnas.1921417117. Epub 2020 Jun 15.
5
Maximizing Engagement in Mobile Health Studies: Lessons Learned and Future Directions.最大化移动医疗研究的参与度:经验教训与未来方向。
Rheum Dis Clin North Am. 2019 May;45(2):159-172. doi: 10.1016/j.rdc.2019.01.004. Epub 2019 Mar 8.
6
STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.在可能非唯一的最优治疗策略下对平均结果的统计推断。
Ann Stat. 2016 Apr;44(2):713-742. doi: 10.1214/15-AOS1384. Epub 2016 Mar 17.
7
Efficacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps.基于情境的体力活动建议的效果:HeartSteps 的微型随机优化试验。
Ann Behav Med. 2019 May 3;53(6):573-582. doi: 10.1093/abm/kay067.
8
Parametric-rate inference for one-sided differentiable parameters.单侧可微参数的参数速率推断。
J Am Stat Assoc. 2018;113(522):780-788. doi: 10.1080/01621459.2017.1285777. Epub 2017 Feb 28.
9
Encouraging Physical Activity in Patients With Diabetes: Intervention Using a Reinforcement Learning System.鼓励糖尿病患者进行体育活动:使用强化学习系统的干预措施。
J Med Internet Res. 2017 Oct 10;19(10):e338. doi: 10.2196/jmir.7994.
10
Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges.用于临床试验优化设计的多臂老虎机模型:益处与挑战
Stat Sci. 2015;30(2):199-215. doi: 10.1214/14-STS504.