Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
BMC Bioinformatics. 2010 May 20;11:272. doi: 10.1186/1471-2105-11-272.
BACKGROUND: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered. RESULTS: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results. CONCLUSIONS: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.
背景:大规模基因组研究通常会识别出大量的基因列表,例如,具有相同表达模式的基因。这些基因列表的解释通常是通过提取基因列表中过度表达的概念来实现的。这种分析通常依赖于基于受控词汇表(特别是基因本体论(GO))的基因手动注释。然而,基因注释是一项劳动密集型的过程;并且词汇表通常不完整,导致一些重要的生物领域没有得到充分覆盖。
结果:我们提出了一种统计方法,该方法使用初级文献(即自由文本)作为来源进行过度表达分析。该方法基于混合模型的统计框架,并解决了几个现有程序中的方法学缺陷。我们在文献挖掘系统 BeeSpace 中实现了该方法,利用其分析环境并添加了便于基因集交互式分析的功能。通过对几个数据集的实验,我们表明,即使传统的基于 GO 的分析没有产生有意义的结果,我们的程序也可以有效地总结大基因集的重要概念主题。
结论:我们得出结论,目前的工作将为生物学家提供一种工具,有效地补充现有的基于基因组实验的过度表达分析工具。我们的程序 Genelist Analyzer 可在以下网址免费获取:http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp。
Bioinformatics. 2010-2-21
BMC Bioinformatics. 2010-2-12
Nucleic Acids Res. 2011-5-9
BMC Bioinformatics. 2009-2-3
Bioinformatics. 2007-11-15
BMC Genomics. 2012-10-26
Nucleic Acids Res. 2011-5-9
Nucleic Acids Res. 2009-7
Nucleic Acids Res. 2009-6
Cell. 2008-7-11
Genome Biol. 2008
Nucleic Acids Res. 2007-1
Pac Symp Biocomput. 2006
Proc Natl Acad Sci U S A. 2006-10-31