Grant Gregory R, Liu Junmin, Stoeckert Christian J
Center for Bioinformatics, University of Pennsylvania, 1429 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021, USA.
Bioinformatics. 2005 Jun 1;21(11):2684-90. doi: 10.1093/bioinformatics/bti407. Epub 2005 Mar 29.
Searching for differentially expressed genes is one of the most common applications for microarrays, yet statistically there are difficult hurdles to achieving adequate rigor and practicality. False discovery rate (FDR) approaches have become relatively standard; however, how to define and control the FDR has been hotly debated. Permutation estimation approaches such as SAM and PaGE can be effective; however, they leave much room for improvement. We pursue the permutation estimation method and describe a convenient definition for the FDR that can be estimated in a straightforward manner. We then discuss issues regarding the choice of statistic and data transformation. It is impossible to optimize the power of any statistic for thousands of genes simultaneously, and we look at the practical consequences of this. For example, the log transform can both help and hurt at the same time, depending on the gene. We examine issues surrounding the SAM 'fudge factor' parameter, and how to handle these issues by optimizing with respect to power.
寻找差异表达基因是微阵列最常见的应用之一,但从统计学角度来看,要达到足够的严谨性和实用性存在诸多困难。错误发现率(FDR)方法已相对标准化;然而,如何定义和控制FDR一直是激烈争论的焦点。诸如SAM和PaGE等排列估计方法可能有效;然而,它们仍有很大的改进空间。我们采用排列估计方法,并描述了一种便于定义的FDR,它可以通过直接的方式进行估计。然后我们讨论了关于统计量选择和数据转换的问题。不可能同时针对数千个基因优化任何统计量的功效,我们探讨了这一情况的实际影响。例如,对数变换可能同时产生帮助和造成损害,这取决于具体基因。我们研究了围绕SAM“调整因子”参数的问题,以及如何通过优化功效来处理这些问题。