Lu Yong, Rosenfeld Roni, Simon Itamar, Nau Gerard J, Bar-Joseph Ziv
Computer Science Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213 USA.
Nucleic Acids Res. 2008 Oct;36(17):e109. doi: 10.1093/nar/gkn434. Epub 2008 Aug 1.
The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annotation databases. One problem is the large number of multiple hypotheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes. To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommodates noise and errors in the selected gene set and GO. Using controlled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.
基因本体论(GO)被广泛用于分析各类高通量实验。然而,研究人员在使用GO和其他功能注释数据库时仍面临若干挑战。一个问题是每项研究要检验大量的多重假设。此外,在层次结构中,类别常常与直接的父类/子类以及其他不相关的类别重叠。这使得难以确定所识别出的显著类别是代表不同的功能结果,还是仅仅是对相同生物过程的冗余观点。为克服这些问题,我们开发了一种生成概率模型,该模型能识别出一组(少量的)类别,这些类别共同解释所选的基因集。我们的模型考虑了所选基因集和GO中的噪声与误差。使用经过控制的GO数据,我们的方法正确地找回了大部分所选类别,相较于当前的GO分析方法有显著改进。当与来自酵母和人类的微阵列表达数据以及芯片杂交数据一起使用时,我们的方法能够正确识别出其他方法所忽略的一般和特定的富集类别。