Wei Peng, Pan Wei
Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building (MMC 303), Minneapolis, MN 55455-0378, USA.
Bioinformatics. 2008 Feb 1;24(3):404-11. doi: 10.1093/bioinformatics/btm612. Epub 2007 Dec 14.
It is a common task in genomic studies to identify a subset of the genes satisfying certain conditions, such as differentially expressed genes or regulatory target genes of a transcription factor (TF). This can be formulated as a statistical hypothesis testing problem. Most existing approaches treat the genes as having an identical and independent distribution a priori, testing each gene independently or testing some subsets of the genes one by one. On the other hand, it is known that the genes work coordinately as dictated by gene networks. Treating genes equally and independently ignores the important information contained in gene networks, leading to inefficient analysis and reduced power.
We propose incorporating gene network information into statistical analysis of genomic data. Specifically, rather than treating the genes equally and independently a priori in a standard mixture model, we assume that gene-specific prior probabilities are correlated as induced by a gene network: while the genes are allowed to have different prior probabilities, those neighboring ones in the network have similar prior probabilities, reflecting their shared biological functions. We applied the two approaches to a real ChIP-chip dataset (and simulated data) to identify the transcriptional target genes of TF GCN4. The new method was found to be more powerful in discovering the target genes.
在基因组研究中,识别满足特定条件的基因子集是一项常见任务,例如差异表达基因或转录因子(TF)的调控靶基因。这可以被表述为一个统计假设检验问题。大多数现有方法先验地将基因视为具有相同且独立的分布,独立地检验每个基因或逐个检验基因的某些子集。另一方面,已知基因在基因网络的支配下协同工作。将基因平等且独立地对待会忽略基因网络中包含的重要信息,导致分析效率低下和功效降低。
我们建议将基因网络信息纳入基因组数据的统计分析中。具体而言,在标准混合模型中,我们不是先验地将基因平等且独立地对待,而是假设基因特异性先验概率由基因网络诱导产生相关性:虽然允许基因具有不同的先验概率,但网络中相邻的基因具有相似的先验概率,这反映了它们共享的生物学功能。我们将这两种方法应用于一个真实的芯片数据集(以及模拟数据),以识别TF GCN4的转录靶基因。发现新方法在发现靶基因方面更具功效。