Peterson Ryan A, Cavanaugh Joseph E
Department of Biostatistics, University of Iowa College of Public Health, Iowa City, IA, USA.
Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
J Appl Stat. 2019 Jun 15;47(13-15):2312-2327. doi: 10.1080/02664763.2019.1630372. eCollection 2020.
Normalization transformations have recently experienced a resurgence in popularity in the era of machine learning, particularly in data preprocessing. However, the classical methods that can be adapted to cross-validation are not always effective. We introduce Ordered Quantile (ORQ) normalization, a one-to-one transformation that is designed to consistently and effectively transform a vector of arbitrary distribution into a vector that follows a normal (Gaussian) distribution. In the absence of ties, ORQ normalization is guaranteed to produce normally distributed transformed data. Once trained, an ORQ transformation can be readily and effectively applied to new data. We compare the effectiveness of the ORQ technique with other popular normalization methods in a simulation study where the true data generating distributions are known. We find that ORQ normalization is the only method that works consistently and effectively, regardless of the underlying distribution. We also explore the use of repeated cross-validation to identify the best normalizing transformation when the true underlying distribution is unknown. We apply our technique and other normalization methods via the bestNormalize R package on a car pricing data set. We built bestNormalize to evaluate the normalization efficacy of many candidate transformations; the package is freely available via the Comprehensive R Archive Network.
归一化变换在机器学习时代近来再度流行起来,尤其是在数据预处理方面。然而,可适用于交叉验证的经典方法并不总是有效。我们引入有序分位数(ORQ)归一化,这是一种一对一变换,旨在将任意分布的向量一致且有效地变换为遵循正态(高斯)分布的向量。在没有平局的情况下,ORQ归一化保证能产生正态分布的变换后数据。一旦训练完成,ORQ变换就能轻松且有效地应用于新数据。在真实数据生成分布已知的模拟研究中,我们将ORQ技术的有效性与其他流行的归一化方法进行了比较。我们发现,无论基础分布如何,ORQ归一化是唯一始终有效工作的方法。当真实的基础分布未知时,我们还探索了使用重复交叉验证来确定最佳归一化变换。我们通过bestNormalize R包将我们的技术和其他归一化方法应用于一个汽车定价数据集。我们构建了bestNormalize来评估许多候选变换的归一化效果;该软件包可通过综合R存档网络免费获取。