Sun Qiang, Zhou Wen-Xin, Fan Jianqing
Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada.
Department of Mathematics, University of California, San Diego, La Jolla, CA 92093.
J Am Stat Assoc. 2020;115(529):254-265. doi: 10.1080/01621459.2018.1543124. Epub 2019 Apr 22.
Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded (1 + )-th moment for any > 0. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when ≥ 1, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0 < < 1 and the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive.
大数据很容易受到异常值的污染,或者包含具有重尾分布的变量,这使得许多传统方法并不适用。为应对这一挑战,我们提出了用于稳健估计和推断的自适应Huber回归。关键的发现是,稳健化参数应适应样本大小、维度和矩,以便在偏差和稳健性之间实现最佳权衡。我们的理论框架处理对于任意>0具有有界(1 + )阶矩的重尾分布。我们在低维和高维中都为回归参数的稳健估计建立了一个清晰的相变:当≥1时,估计量在不对数据做次高斯假设的情况下具有次高斯型偏差界,而在0 << 1的情况下只有较慢的速率,并且这种转变是平滑且最优的。此外,我们扩展了该方法以同时允许重尾预测变量和观测噪声。模拟研究进一步支持了该理论。在一项对表现出重尾性的癌细胞系的遗传学研究中,所提出的方法被证明更稳健且具有预测性。