Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany.
Nucleic Acids Res. 2010 Jun;38(11):3523-32. doi: 10.1093/nar/gkq045. Epub 2010 Feb 19.
The interpretation of data-driven experiments in genomics often involves a search for biological categories that are enriched for the responder genes identified by the experiments. However, knowledge bases such as the Gene Ontology (GO) contain hundreds or thousands of categories with very high overlap between categories. Thus, enrichment analysis performed on one category at a time frequently returns large numbers of correlated categories, leaving the choice of the most relevant ones to the user's; interpretation. Here we present model-based gene set analysis (MGSA) that analyzes all categories at once by embedding them in a Bayesian network, in which gene response is modeled as a function of the activation of biological categories. Probabilistic inference is used to identify the active categories. The Bayesian modeling approach naturally takes category overlap into account and avoids the need for multiple testing corrections met in single-category enrichment analysis. On simulated data, MGSA identifies active categories with up to 95% precision at a recall of 20% for moderate settings of noise, leading to a 10-fold precision improvement over single-category statistical enrichment analysis. Application to a gene expression data set in yeast demonstrates that the method provides high-level, summarized views of core biological processes and correctly eliminates confounding associations.
在基因组学中,对数据驱动实验的解释通常涉及到寻找对实验中确定的响应基因富集的生物学类别。然而,像基因本体论(GO)这样的知识库包含数百或数千个类别,类别之间的重叠非常高。因此,一次对一个类别进行的富集分析经常会返回大量相关的类别,这使得选择最相关的类别留给用户进行解释。在这里,我们提出了基于模型的基因集分析(MGSA),它通过将它们嵌入贝叶斯网络中来同时分析所有类别,其中基因响应被建模为生物类别的激活的函数。概率推理用于识别活跃的类别。贝叶斯建模方法自然考虑了类别重叠,并避免了在单类别富集分析中遇到的多重测试校正的需要。在模拟数据中,MGSA 在噪声适中的情况下,以 20%的召回率达到高达 95%的精度,比单类别统计富集分析提高了 10 倍的精度。在酵母的一个基因表达数据集上的应用表明,该方法提供了核心生物过程的高级、汇总视图,并正确消除了混杂关联。