Efron Bradley
Department of Statistics, Stanford University.
J Am Stat Assoc. 2010 Sep 1;105(491):1042-1055. doi: 10.1198/jasa.2010.tm09129.
We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects' expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N · (N - 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under non-null conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.
我们考虑进行大规模研究,其中有数百或数千个相关病例需要调查,每个病例都由其自身的正态变量表示,通常是一个z值。一个常见的例子是微阵列实验,该实验比较了健康受试者和患病受试者数千个基因的表达水平。本文关注正态变量集合的汇总统计量的准确性,例如它们的经验累积分布函数或错误发现率统计量。似乎我们必须估计一个N×N的相关矩阵,N为病例数,但我们的主要结果表明这是不必要的:良好的准确性近似可以基于所有N·(N - 1)/2对的均方根相关性,这一量通常很容易估计。第二个结果表明,即使在非零条件下,z值也紧密遵循正态分布,这支持了主定理的应用。针对一项大型白血病微阵列研究说明了该理论的实际应用。