Zhu Dongxiao, Li Youjuan, Li Hua
Stowers Institute for Medical Research, 1000 E 50th Street, Kansas City, MO 64110, USA.
Bioinformatics. 2007 Sep 1;23(17):2298-305. doi: 10.1093/bioinformatics/btm328. Epub 2007 Jun 22.
Estimating pairwise correlation from replicated genome-scale (a.k.a. OMICS) data is fundamental to cluster functionally relevant biomolecules to a cellular pathway. The popular Pearson correlation coefficient estimates bivariate correlation by averaging over replicates. It is not completely satisfactory since it introduces strong bias while reducing variance. We propose a new multivariate correlation estimator that models all replicates as independent and identically distributed (i.i.d.) samples from the multivariate normal distribution. We derive the estimator by maximizing the likelihood function. For small sample data, we provide a resampling-based statistical inference procedure, and for moderate to large sample data, we provide an asymptotic statistical inference procedure based on the Likelihood Ratio Test (LRT). We demonstrate advantages of the new multivariate correlation estimator over Pearson bivariate correlation estimator using simulations and real-world data analysis examples.
The estimator and statistical inference procedures have been implemented in an R package 'CORREP' that is available from CRAN [http://cran.r-project.org] and Bioconductor [http://www.bioconductor.org/].
Supplementary data are available at Bioinformatics online.
从重复的基因组规模(即组学)数据中估计成对相关性,对于将功能相关的生物分子聚类到细胞通路至关重要。流行的皮尔逊相关系数通过对重复样本求平均来估计二元相关性。它并不完全令人满意,因为它在降低方差的同时引入了强烈的偏差。我们提出了一种新的多变量相关性估计器,该估计器将所有重复样本建模为来自多元正态分布的独立同分布(i.i.d.)样本。我们通过最大化似然函数来推导该估计器。对于小样本数据,我们提供基于重采样的统计推断程序,对于中等到大样本数据,我们提供基于似然比检验(LRT)的渐近统计推断程序。我们通过模拟和实际数据分析示例展示了新的多变量相关性估计器相对于皮尔逊二元相关性估计器的优势。
该估计器和统计推断程序已在R包“CORREP”中实现,可从CRAN [http://cran.r-project.org] 和Bioconductor [http://www.bioconductor.org/] 获得。
补充数据可在《生物信息学》在线获取。