Hu Ming, Qin Zhaohui S
Department of Biostatistics, Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan, United States of America.
PLoS One. 2009;4(2):e4495. doi: 10.1371/journal.pone.0004495. Epub 2009 Feb 13.
In microarray gene expression data analysis, it is often of interest to identify genes that share similar expression profiles with a particular gene such as a key regulatory protein. Multiple studies have been conducted using various correlation measures to identify co-expressed genes. While working well for small datasets, the heterogeneity introduced from increased sample size inevitably reduces the sensitivity and specificity of these approaches. This is because most co-expression relationships do not extend to all experimental conditions. With the rapid increase in the size of microarray datasets, identifying functionally related genes from large and diverse microarray gene expression datasets is a key challenge. We develop a model-based gene expression query algorithm built under the Bayesian model selection framework. It is capable of detecting co-expression profiles under a subset of samples/experimental conditions. In addition, it allows linearly transformed expression patterns to be recognized and is robust against sporadic outliers in the data. Both features are critically important for increasing the power of identifying co-expressed genes in large scale gene expression datasets. Our simulation studies suggest that this method outperforms existing correlation coefficients or mutual information-based query tools. When we apply this new method to the Escherichia coli microarray compendium data, it identifies a majority of known regulons as well as novel potential target genes of numerous key transcription factors.
在微阵列基因表达数据分析中,人们常常希望识别出与特定基因(如关键调控蛋白)具有相似表达谱的基因。已经开展了多项研究,使用各种相关性度量来识别共表达基因。虽然这些方法在处理小数据集时效果良好,但样本量增加所引入的异质性不可避免地降低了这些方法的灵敏度和特异性。这是因为大多数共表达关系并不适用于所有实验条件。随着微阵列数据集规模的迅速增大,从庞大且多样的微阵列基因表达数据集中识别功能相关基因是一项关键挑战。我们开发了一种基于模型的基因表达查询算法,该算法构建于贝叶斯模型选择框架之下。它能够在一部分样本/实验条件下检测共表达谱。此外,它能够识别线性变换后的表达模式,并且对数据中的零星异常值具有鲁棒性。这两个特性对于提高在大规模基因表达数据集中识别共表达基因的能力至关重要。我们的模拟研究表明,该方法优于现有的基于相关系数或互信息的查询工具。当我们将这种新方法应用于大肠杆菌微阵列综合数据时,它识别出了大多数已知的调控子以及众多关键转录因子的新的潜在靶基因。