Zhao Yanli, Pan Wei
Division of Biostatistics, School of Public Health, University of Minnesota, MMC 303, A460 Mayo Building, 420 Delaware Street SE, Minneapolis, MN 55455, USA.
Bioinformatics. 2003 Jun 12;19(9):1046-54. doi: 10.1093/bioinformatics/btf879.
An important goal in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Various parametric tests, such as the two-sample t-test, have been used, but their possibly too strong parametric assumptions or large sample justifications may not hold in practice. As alternatives, a class of three nonparametric statistical methods, including the empirical Bayes method of Efron et al. (2001), the significance analysis of microarray (SAM) method of Tusher et al. (2001) and the mixture model method (MMM) of Pan et al. (2001), have been proposed. All the three methods depend on constructing a test statistic and a so-called null statistic such that the null statistic's distribution can be used to approximate the null distribution of the test statistic. However, relatively little effort has been directed toward assessment of the performance or the underlying assumptions of the methods in constructing such test and null statistics.
We point out a problem of a current method to construct the test and null statistics, which may lead to largely inflated Type I errors (i.e. false positives). We also propose two modifications that overcome the problem. In the context of MMM, the improved performance of the modified methods is demonstrated using simulated data. In addition, our numerical results also provide evidence to support the utility and effectiveness of MMM.
分析微阵列数据的一个重要目标是确定哪些基因在两种组织样本或在两种实验条件下获得的样本之间存在差异表达。已经使用了各种参数检验,例如双样本t检验,但其可能过于严格的参数假设或大样本条件在实际中可能不成立。作为替代方法,已经提出了一类三种非参数统计方法,包括Efron等人(2001年)的经验贝叶斯方法、Tusher等人(2001年)的微阵列显著性分析(SAM)方法以及Pan等人(2001年)的混合模型方法(MMM)。这三种方法都依赖于构建一个检验统计量和一个所谓的零统计量,使得零统计量的分布可用于近似检验统计量的零分布。然而,在构建此类检验统计量和零统计量时,针对这些方法的性能评估或潜在假设的研究相对较少。
我们指出了当前构建检验统计量和零统计量方法存在的一个问题,该问题可能导致第一类错误(即假阳性)大幅增加。我们还提出了两种改进方法来克服这个问题。在MMM的背景下,使用模拟数据展示了改进方法的性能提升。此外,我们的数值结果也为支持MMM的实用性和有效性提供了证据。