Allen Genevera I, Tibshirani Robert
Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, & Department of Statistics, Rice University, Houston, TX, 77005.
Departments of Health Research & Policy and Statistics, Stanford University, Stanford, CA, 94305.
J R Stat Soc Series B Stat Methodol. 2012 Sep;74(4):721-743. doi: 10.1111/j.1467-9868.2011.01027.x. Epub 2012 Mar 16.
We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent due to latent variables or unknown batch effects. By modeling this matrix data using the matrix-variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously estimate row and column covariances and use these to sphere or de-correlate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance of the false discovery rate estimators.
我们考虑对矩阵形式数据的行变量或列变量进行大规模推断的问题。这些数据矩阵中有许多意味着行变量和列变量都不能被视为独立实例。这种情况的一个例子是在微阵列中检测显著基因,此时样本可能由于潜在变量或未知批次效应而相关。通过使用矩阵变量正态分布对这种矩阵数据进行建模,我们研究并量化了行和列相关性对大规模推断程序的影响。然后,我们针对意外相关性带来的众多问题提出了一个简单的解决方案:我们同时估计行和列协方差,并在进行推断之前使用这些协方差对基础数据中的噪声进行球化或去相关处理。此过程产生具有近似独立行和列的数据,以便检验统计量更紧密地遵循零分布,并且多重检验程序能够正确控制所需的错误率。在模拟模型和真实微阵列数据上的结果证明了这种方法的主要优点:(1)提高统计功效,(2)在估计错误发现率时偏差更小,以及(3)降低错误发现率估计器的方差。