Fithian William, Hastie Trevor
Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305-4065, USA.
Ann Stat. 2014 Oct 1;42(5):1693-1724. doi: 10.1214/14-AOS1220.
For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.
对于存在显著类别不平衡的分类问题,子采样可以降低计算成本,但代价是估计模型参数时方差会增大。我们提出了一种通过接受-拒绝方案在特征空间中局部调整类别平衡来高效地对逻辑回归进行子采样的方法。我们的方法推广了标准的病例对照采样,使用一个初步估计来优先选择那些在给定其特征的情况下响应条件罕见的示例。通过对参数进行事后分析调整来校正有偏子采样。该方法很简单,并且需要对整个数据集进行一次可并行化扫描。对于总体风险最小化系数θ*,在模型误设的情况下,标准的病例对照采样是不一致的。相比之下,只要初步估计是一致的,我们的估计量对于θ*就是一致的。此外,在正确设定且有一个一致、独立的初步估计的情况下,我们的估计量的渐近方差恰好是全样本极大似然估计(MLE)的两倍——即使所选子样本只占全数据集的极小部分,就像原始数据严重不平衡时那样。如果我们将基线接受概率乘以大于1的数(并对接受概率大于1的点进行加权),那么这个因子2会改进为[公式:见原文],此时子样本中纳入的数据点数量大约是原来的[公式:见原文]倍。在模拟数据和真实数据上的实验表明,我们的方法可以显著优于标准的病例对照子采样。