Wang Fan, Mukherjee Sach, Richardson Sylvia, Hill Steven M
1MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.
2German Centre for Neurodegenerative Diseases (DZNE), Bonn, Germany.
Stat Comput. 2020;30(3):697-719. doi: 10.1007/s11222-019-09914-9. Epub 2019 Dec 19.
Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a "no panacea" view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.
惩罚似然方法在高维回归中被广泛使用。尽管已经提出了许多方法,并且相关理论现在也已经得到了很好的发展,但在实际中遇到的有限样本设置下,不同方法的相对有效性仍然没有得到完全理解。因此,在这一领域需要进行实证研究,以便为用户提供实际的见解和指导。在本文中,我们对惩罚回归方法进行了大规模比较。我们区分了三个相关目标:预测、变量选择和变量排序。我们的结果涵盖了2300多个数据生成场景,包括合成数据和半合成数据(真实协变量和模拟响应),这使我们能够系统地考虑各种因素(样本大小、维度、稀疏性、信号强度和多重共线性)的影响。我们考虑了几种广泛使用的方法(套索回归、自适应套索回归、弹性网络、岭回归、平滑截断绝对偏差、丹齐格选择器和稳定性选择)。我们发现不同方法之间的性能存在很大差异。我们的结果支持“没有万灵药”的观点,即使在所有数据都与方法所基于的假设非常吻合的这种受限设置中,在所有场景或目标中也没有明确的赢家。这项研究使我们能够就给定目标和一些数据特征,推荐哪些方法可能最(或最不)合适。我们的实证结果补充了现有理论,并提供了一个资源,用于在一系列场景和指标下比较方法。