School of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA.
Plant Physiol. 2012 Sep;160(1):192-203. doi: 10.1104/pp.112.201962. Epub 2012 Jul 13.
One of the computational challenges in plant systems biology is to accurately infer transcriptional regulation relationships based on correlation analyses of gene expression patterns. Despite several correlation methods that are applied in biology to analyze microarray data, concerns regarding the compatibility of these methods with the gene expression data profiled by high-throughput RNA transcriptome sequencing (RNA-Seq) technology have been raised. These concerns are mainly due to the fact that the distribution of read counts in RNA-Seq experiments is different from that of fluorescence intensities in microarray experiments. Therefore, a comprehensive evaluation of the existing correlation methods and, if necessary, introduction of novel methods into biology is appropriate. In this study, we compared four existing correlation methods used in microarray analysis and one novel method called the Gini correlation coefficient on previously published microarray-based and sequencing-based gene expression data in Arabidopsis (Arabidopsis thaliana) and maize (Zea mays). The comparisons were performed on more than 11,000 regulatory relationships in Arabidopsis, including 8,929 pairs of transcription factors and target genes. Our analyses pinpointed the strengths and weaknesses of each method and indicated that the Gini correlation can compensate for the shortcomings of the Pearson correlation, the Spearman correlation, the Kendall correlation, and the Tukey's biweight correlation. The Gini correlation method, with the other four evaluated methods in this study, was implemented as an R package named rsgcc that can be utilized as an alternative option for biologists to perform clustering analyses of gene expression patterns or transcriptional network analyses.
植物系统生物学中的一个计算挑战是根据基因表达模式的相关分析来准确推断转录调控关系。尽管生物学中已经应用了几种相关方法来分析微阵列数据,但人们对这些方法与高通量 RNA 转录组测序 (RNA-Seq) 技术所 profiling 的基因表达数据的兼容性表示担忧。这些担忧主要是由于 RNA-Seq 实验中的读取计数分布与微阵列实验中的荧光强度分布不同。因此,对现有相关方法进行全面评估,并在必要时将新方法引入生物学是合适的。在这项研究中,我们比较了微阵列分析中使用的四种现有相关方法和一种新方法,即基尼相关系数,在拟南芥(Arabidopsis thaliana)和玉米(Zea mays)的先前发表的基于微阵列和基于测序的基因表达数据上。在 Arabidopsis 中进行了超过 11000 个调控关系的比较,包括 8929 对转录因子和靶基因。我们的分析指出了每种方法的优缺点,并表明基尼相关可以弥补 Pearson 相关、Spearman 相关、Kendall 相关和 Tukey 的双权相关的缺点。基尼相关方法与本研究中评估的其他四种方法一起实现为一个名为 rsgcc 的 R 包,可以作为生物学家替代选项,用于执行基因表达模式或转录网络分析的聚类分析。