Luedtke Alex, Chambaz Antoine
Department of Statistics, University of Washington, USA.
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, USA.
Ann I H P Probab Stat. 2020 Aug;56(3):2162-2188. doi: 10.1214/19-aihp1034. Epub 2020 Jun 26.
This article gives performance guarantees for the regret decay in optimal policy estimation. We give a margin-free result showing that the regret decay for estimating a within-class optimal policy is second-order for empirical risk minimizers over Donsker classes when the data are generated from a fixed data distribution that does not change with sample size, with regret decaying at a faster rate than the standard error of an efficient estimator of the value of an optimal policy. We also present a result giving guarantees on the regret decay of policy estimators for the case that the policy falls within a restricted class and the data are generated from local perturbations of a fixed distribution, where this guarantee is uniform in the direction of the local perturbation. Finally, we give a result from the classification literature that shows that faster regret decay is possible via plug-in estimation provided a margin condition holds. Three examples are considered. In these examples, the regret is expressed in terms of either the mean value or the median value, and the number of possible actions is either two or finitely many.
本文给出了最优策略估计中遗憾值衰减的性能保证。我们给出了一个无余量结果,表明当数据由不随样本量变化的固定数据分布生成时,对于经验风险最小化者而言,在唐斯克类上估计类内最优策略时的遗憾值衰减是二阶的,其遗憾值衰减速度比最优策略值的有效估计量的标准误差更快。我们还给出了一个结果,对于策略属于受限类且数据由固定分布的局部扰动生成的情况,给出了策略估计器遗憾值衰减的保证,该保证在局部扰动方向上是一致的。最后,我们给出了分类文献中的一个结果,表明如果余量条件成立,通过插件估计可以实现更快的遗憾值衰减。我们考虑了三个例子。在这些例子中,遗憾值用均值或中位数表示,可能的行动数量为两个或有限多个。