Zhu Dongxiao, Hero Alfred O, Qin Zhaohui S, Swaroop Anand
Bioinformatics Program, University of Michigan, Ann Arbor, MI 48109, USA.
J Comput Biol. 2005 Sep;12(7):1029-45. doi: 10.1089/cmb.2005.12.1029.
Many exploratory microarray data analysis tools such as gene clustering and relevance networks rely on detecting pairwise gene co-expression. Traditional screening of pairwise co-expression either controls biological significance or statistical significance, but not both. The former approach does not provide stochastic error control, and the later approach screens many co-expressions with excessively low correlation. We have designed and implemented a statistically sound two-stage co-expression detection algorithm that controls both statistical significance (false discovery rate, FDR) and biological significance (minimum acceptable strength, MAS) of the discovered co-expressions. Based on estimation of pairwise gene correlation, the algorithm provides an initial co-expression discovery that controls only FDR, which is then followed by a second stage co-expression discovery which controls both FDR and MAS. It also computes and thresholds the set of FDR p-values for each correlation that satisfied the MAS criterion. Using simulated data, we validated asymptotic null distributions of the Pearson and Kendall correlation coefficients and the two-stage error-control procedure; we also compared our two-stage test procedure with another two-stage test procedure using the receiver operating characteristic (ROC) curve. We then used yeast galactose metabolism data to illustrate the advantage of our method for clustering genes and constructing a relevance network. The method has been implemented in an R package "GeneNT" that is freely available from the Comprehensive R Archive Network (CRAN): www.cran.r-project.org/.
许多探索性微阵列数据分析工具,如基因聚类和相关性网络,都依赖于检测成对基因的共表达。传统的成对共表达筛选要么控制生物学意义,要么控制统计学意义,但不能同时控制两者。前一种方法无法提供随机误差控制,而后一种方法会筛选出许多相关性极低的共表达。我们设计并实现了一种统计上合理的两阶段共表达检测算法,该算法能同时控制所发现共表达的统计学意义(错误发现率,FDR)和生物学意义(最小可接受强度,MAS)。基于成对基因相关性的估计,该算法首先进行仅控制FDR的初始共表达发现,随后进行同时控制FDR和MAS的第二阶段共表达发现。它还会计算并设定满足MAS标准的每个相关性的FDR p值集合的阈值。我们使用模拟数据验证了Pearson和Kendall相关系数的渐近零分布以及两阶段误差控制程序;我们还使用接收器操作特征(ROC)曲线将我们的两阶段测试程序与另一种两阶段测试程序进行了比较。然后,我们使用酵母半乳糖代谢数据来说明我们的方法在基因聚类和构建相关性网络方面的优势。该方法已在一个名为“GeneNT”的R包中实现,可从综合R存档网络(CRAN)免费获取:www.cran.r-project.org/ 。