Dazard Jean-Eudes, Rao J Sunil
Division of Bioinformatics, Center for Proteomics and Bioinformatics, Case Western Reserve University. Cleveland, OH 44106, USA.
Comput Stat Data Anal. 2012 Jul 1;56(7):2317-2333. doi: 10.1016/j.csda.2012.01.012.
The paper addresses a common problem in the analysis of high-dimensional high-throughput "omics" data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel "similarity statistic"-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called 'MVR' ('Mean-Variance Regularization'), downloadable from the CRAN website.
本文探讨了高维高通量“组学”数据分析中的一个常见问题,即在一组数据中,变量数量远大于样本量时,对多个变量进行参数估计。这类数据带来的问题包括:由于自由度不足,特定变量的方差估计量不可靠,且逐变量检验统计量的功效较低。此外,在这类数据中还观察到方差随均值的变化而增加。我们引入了一种非参数自适应正则化方法,其创新之处在于:(i)它采用了一种基于新型“相似性统计量”的聚类技术,以生成总体参数的局部合并或正则化收缩估计量;(ii)正则化是在总体矩上联合进行的,受益于C. Stein关于不可容许性的结果,这意味着使用样本均值中包含的信息,通过收缩估计量可以改进常用的样本方差估计量。从这些联合正则化收缩估计量中,我们推导出正则化的类似t统计量,并在模拟研究中表明,与标准样本对应物、正则化公共值收缩估计量或简单忽略样本均值中包含的信息时相比,它们在假设检验中具有更强的统计功效。最后,我们表明这些估计量具有有趣的方差稳定化和归一化特性,可用于高维多元数据的预处理。该方法作为一个名为“MVR”(“均值 - 方差正则化”)的R包提供,可从CRAN网站下载。