Huch Easton K, Shi Jieru, Abbott Madeline R, Golbus Jessica R, Moreno Alexander, Dempsey Walter H
Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA.
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
Adv Neural Inf Process Syst. 2024;37:128280-128329.
Mobile health leverages personalized, contextually-tailored interventions optimized through bandit and reinforcement learning algorithms. Despite its promise, challenges like participant heterogeneity, nonstationarity, and nonlinearity in rewards hinder algorithm performance. We propose a robust contextual bandit algorithm, termed "DML-TS-NNR", that simultaneously addresses these challenges via (1) modeling the differential reward with user- and time-specific incidental parameters, (2) network cohesion penalties, and (3) debiased machine learning for flexible estimation of baseline rewards. We establish a high-probability regret bound that depends solely on the dimension of the differential reward model. This feature enables us to achieve robust regret bounds even when the baseline reward is highly complex. We demonstrate the superior performance of the DML-TS-NNR algorithm in a simulation and two off-policy evaluation studies.
移动健康利用通过强盗算法和强化学习算法优化的个性化、情境定制干预措施。尽管它前景广阔,但参与者异质性、非平稳性和奖励非线性等挑战阻碍了算法性能。我们提出了一种稳健的情境强盗算法,称为“DML-TS-NNR”,该算法通过以下方式同时应对这些挑战:(1)使用特定于用户和时间的附带参数对差异奖励进行建模;(2)网络凝聚惩罚;(3)用于灵活估计基线奖励的去偏机器学习。我们建立了一个仅依赖于差异奖励模型维度的高概率遗憾界。这一特性使我们即使在基线奖励非常复杂的情况下也能实现稳健的遗憾界。我们在一项模拟和两项离策略评估研究中展示了DML-TS-NNR算法的卓越性能。