Scott David W, Wang Zhipeng
Department of Statistics, Rice University, MS-138, 6100 Main Street, Houston, TX 77005, USA.
Apple Corporation, Cupertino, CA 95014, USA.
Entropy (Basel). 2021 Jan 9;23(1):88. doi: 10.3390/e23010088.
As modern data analysis pushes the boundaries of classical statistics, it is timely to reexamine alternate approaches to dealing with outliers in multiple regression. As sample sizes and the number of predictors increase, interactive methodology becomes less effective. Likewise, with limited understanding of the underlying contamination process, diagnostics are likely to fail as well. In this article, we advocate for a non-likelihood procedure that attempts to quantify the fraction of bad data as a part of the estimation step. These ideas also allow for the selection of important predictors under some assumptions. As there are many robust algorithms available, running several and looking for interesting differences is a sensible strategy for understanding the nature of the outliers.
随着现代数据分析突破经典统计学的界限,重新审视多元回归中处理异常值的替代方法恰逢其时。随着样本量和预测变量数量的增加,交互方法的效果会变差。同样,由于对潜在污染过程的了解有限,诊断方法也可能失效。在本文中,我们提倡一种非似然程序,该程序试图在估计步骤中量化坏数据的比例。在某些假设下,这些想法还允许选择重要的预测变量。由于有许多稳健算法可用,运行多个算法并寻找有趣的差异是理解异常值性质的明智策略。