Bibaut Aurélien, Kallus Nathan, Dimakopoulou Maria, Chambaz Antoine, van der Laan Mark
Netflix.
Cornell University and Netflix.
Adv Neural Inf Process Syst. 2021 Dec;34:19261-19273.
Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that carefully leverages the importance sampling structure to obtain rates with the good dependence on the exploration rate in the data. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.
经验风险最小化(ERM)是机器学习的核心方法,无论是用于分类和回归,还是用于离策略策略学习。但是,当我们使用自适应收集的数据(例如运行上下文博弈算法的结果)时,其与模型无关的保证可能会失效。我们研究了一种通用的重要性采样加权ERM算法,该算法使用自适应收集的数据来最小化假设类上损失函数的平均值,并提供了同类中的首个泛化保证和快速收敛率。我们的结果基于一个新的极大不等式,该不等式仔细利用重要性采样结构,以获得对数据中探索率具有良好依赖性的速率。对于回归,我们提供利用平方误差损失的强凸性的快速速率。对于策略学习,我们提供后悔保证,只要探索衰减到零(如博弈收集的数据那样),就能弥合现有文献中的一个开放差距。实证研究验证了我们的理论。