Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
Bioinformatics. 2011 Feb 15;27(4):509-15. doi: 10.1093/bioinformatics/btq701. Epub 2010 Dec 24.
It is well known that patterns of differential gene expression across biological conditions are often shared by many genes, particularly those within functional groups. Taking advantage of these patterns can lead to increased statistical power and biological clarity when testing for differential expression in a microarray experiment. The optimal discovery procedure (ODP), which maximizes the expected number of true positives for each fixed number of expected false positives, is a framework aimed at this goal. Storey et al. introduced an estimator of the ODP for identifying differentially expressed genes. However, their ODP estimator grows quadratically in computational time with respect to the number of genes. Reducing this computational burden is a key step in making the ODP practical for usage in a variety of high-throughput problems.
Here, we propose a new estimate of the ODP called the modular ODP (mODP). The existing 'full ODP' requires that the likelihood function for each gene be evaluated according to the parameter estimates for all genes. The mODP assigns genes to modules according to a Kullback-Leibler distance, and then evaluates the statistic only at the module-averaged parameter estimates. We show that the mODP is relatively insensitive to the choice of the number of modules, but dramatically reduces the computational complexity from quadratic to linear in the number of genes. We compare the full ODP algorithm and mODP on simulated data and gene expression data from a recent study of Morrocan Amazighs. The mODP and full ODP algorithm perform very similarly across a range of comparisons.
The mODP methodology has been implemented into EDGE, a comprehensive gene expression analysis software package in R, available at http://genomine.org/edge/.
众所周知,生物条件下差异基因表达的模式通常在许多基因中共享,特别是那些在功能组内的基因。利用这些模式可以在微阵列实验中检测差异表达时提高统计能力和生物学清晰度。最优发现程序(ODP)旨在实现这一目标,它最大化了每个固定数量的预期假阳性的真实阳性的预期数量。Storey 等人引入了一种识别差异表达基因的 ODP 估计器。然而,他们的 ODP 估计器的计算时间随着基因数量的增加呈二次增长。降低这种计算负担是使 ODP 在各种高通量问题中实际使用的关键步骤。
在这里,我们提出了一种称为模块化 ODP(mODP)的 ODP 的新估计。现有的“完整 ODP”要求根据所有基因的参数估计来评估每个基因的似然函数。mODP 根据 Kullback-Leibler 距离将基因分配到模块中,然后仅在模块平均参数估计处评估统计量。我们表明,mODP 对模块数量的选择相对不敏感,但将计算复杂度从二次降低到线性数量的基因。我们在模拟数据和最近对摩洛哥 Amazighs 的研究中的基因表达数据上比较了完整的 ODP 算法和 mODP。mODP 和完整 ODP 算法在一系列比较中表现非常相似。
mODP 方法已被实现到 EDGE 中,这是一个在 R 中的综合基因表达分析软件包,可在 http://genomine.org/edge/ 获得。