Tan Yuande, Liu Yin
Bioinformation. 2011;7(8):400-4. doi: 10.6026/97320630007400. Epub 2011 Dec 21.
Identification of genes differentially expressed across multiple conditions has become an important statistical problem in analyzing large-scale microarray data. Many statistical methods have been developed to address the challenging problem. Therefore, an extensive comparison among these statistical methods is extremely important for experimental scientists to choose a valid method for their data analysis. In this study, we conducted simulation studies to compare six statistical methods: the Bonferroni (B-) procedure, the Benjamini and Hochberg (BH-) procedure, the Local false discovery rate (Localfdr) method, the Optimal Discovery Procedure (ODP), the Ranking Analysis of F-statistics (RAF), and the Significant Analysis of Microarray data (SAM) in identifying differentially expressed genes. We demonstrated that the strength of treatment effect, the sample size, proportion of differentially expressed genes and variance of gene expression will significantly affect the performance of different methods. The simulated results show that ODP exhibits an extremely high power in indentifying differentially expressed genes, but significantly underestimates the False Discovery Rate (FDR) in all different data scenarios. The SAM has poor performance when the sample size is small, but is among the best-performing methods when the sample size is large. The B-procedure is stringent and thus has a low power in all data scenarios. Localfdr and RAF show comparable statistical behaviors with the BH-procedure with favorable power and conservativeness of FDR estimation. RAF performs the best when proportion of differentially expressed genes is small and treatment effect is weak, but Localfdr is better than RAF when proportion of differentially expressed genes is large.
识别在多种条件下差异表达的基因已成为分析大规模微阵列数据时一个重要的统计学问题。人们已开发出许多统计方法来解决这一具有挑战性的问题。因此,对这些统计方法进行广泛比较对于实验科学家选择有效的数据分析方法极为重要。在本研究中,我们进行了模拟研究,以比较六种统计方法:邦费罗尼(B-)程序、本雅明尼和霍奇伯格(BH-)程序、局部错误发现率(Localfdr)方法、最优发现程序(ODP)、F统计量的排序分析(RAF)以及微阵列数据的显著性分析(SAM)在识别差异表达基因方面的表现。我们证明了处理效应的强度、样本量、差异表达基因的比例以及基因表达的方差会显著影响不同方法的性能。模拟结果表明,ODP在识别差异表达基因方面具有极高的功效,但在所有不同的数据情形下都显著低估了错误发现率(FDR)。当样本量较小时,SAM的表现较差,但在样本量较大时是表现最佳的方法之一。B-程序较为严格,因此在所有数据情形下功效都较低。Localfdr和RAF与BH-程序表现出可比的统计行为,具有良好的功效和FDR估计的保守性。当差异表达基因的比例较小时且处理效应较弱时,RAF表现最佳,但当差异表达基因的比例较大时,Localfdr比RAF更好。