Zhong Sheng, Xie Dan
Department of Bioengineering, University of Illinois at Urbana Champaign, IL 61801, United States.
Artif Intell Med. 2007 Oct;41(2):105-15. doi: 10.1016/j.artmed.2007.08.002.
Gene Ontology (GO) has become a routine resource for functional analysis of gene lists. Although a number of tools have been provided to identify enriched GO terms in one or two gene lists, two technical challenges remain. First, how to handle multiple hypothesis testing in the analysis given that the tests are heavily correlated; second, how to identify GO terms that are enriched in a gene cluster, as compared to multiple other gene clusters. We provide a statistical procedure to rigorously treat these problems and offer a software tool for applying GO to the analysis of gene clusters.
We previously introduced a statistical procedure that handles hypothesis testing in a two-group comparison scenario. In this paper we extend the two-group comparison procedure into a general procedure that enables the analysis of any number of gene lists/clusters. This new procedure enables identification of GO terms enriched in any gene cluster, while it controls for multiple hypothesis testing. This procedure is implemented into a user-friendly analysis tool: GoSurfer. The current version of GoSurfer takes one or several gene lists as input, and it identifies the GO terms that are enriched in any of the input gene lists. GoSurfer estimates a conservative false discovery rate (FDR) for every GO term. The FDR estimation procedure in GoSurfer has two advantages: it does not rely on independence assumption, and it does not assume all the hypotheses are null hypothesis (complete null). Thus GoSurfer's FDR estimates are mildly conservative rather than overly conservative.
We implemented the new procedure for GO analysis in multiple gene clusters into the GoSurfer software. We provide three examples on using GoSurfer to analyze time course gene expression data sets on the differentiation of embryonic stem cells. In the example of analysis of multiple gene clusters, we first used a typical clustering algorithm and identified five gene clusters, representing up-regulation, down-regulation and other patterns in the differentiation time course. Taking all the five gene clusters as input data, GoSurfer reports "cell adhesion" and "muscle contraction" as significant GO terms for the up-regulated cluster, "amino acids metabolism" as a significant GO term for the down-regulated gene cluster, and GoSurfer reports a number of GO terms related to RNA processing and RNA transport as significant terms to a cluster that is up-regulated in both early and late time points. This may suggest that genes for RNA processing and genes for RNA transport are coregulated in the differentiation process of embryonic stem cells.
The GoSurfer software is provided to analyze multiple gene clusters and identify GO terms that are enriched in any gene cluster. Gosurfer is available at: www.gosurfer.org.
基因本体论(Gene Ontology,GO)已成为对基因列表进行功能分析的常规资源。尽管已经提供了许多工具来识别一两个基因列表中富集的GO术语,但仍存在两个技术挑战。第一,鉴于测试之间高度相关,在分析中如何处理多重假设检验;第二,与多个其他基因簇相比,如何识别在一个基因簇中富集的GO术语。我们提供了一种统计程序来严格处理这些问题,并提供了一个将GO应用于基因簇分析的软件工具。
我们之前介绍了一种在两组比较场景中处理假设检验的统计程序。在本文中,我们将两组比较程序扩展为一个通用程序,该程序能够分析任意数量的基因列表/簇。这个新程序能够识别在任何基因簇中富集的GO术语,同时控制多重假设检验。这个程序被实现为一个用户友好的分析工具:GoSurfer。GoSurfer的当前版本以一个或几个基因列表作为输入,并识别在任何输入基因列表中富集的GO术语。GoSurfer为每个GO术语估计一个保守的错误发现率(FDR)。GoSurfer中的FDR估计程序有两个优点:它不依赖于独立性假设,并且它不假设所有假设都是零假设(完全零假设)。因此,GoSurfer的FDR估计是适度保守的,而不是过度保守的。
我们将用于多个基因簇GO分析的新程序实现到GoSurfer软件中。我们提供了三个使用GoSurfer分析胚胎干细胞分化过程中时间进程基因表达数据集的例子。在多个基因簇分析的例子中,我们首先使用一种典型的聚类算法,识别出五个基因簇,代表分化时间进程中的上调、下调和其他模式。将所有五个基因簇作为输入数据,GoSurfer报告“细胞黏附”和“肌肉收缩”是上调簇的显著GO术语,“氨基酸代谢”是下调基因簇的显著GO术语,并且GoSurfer报告了一些与RNA加工和RNA转运相关的GO术语是在早期和晚期都上调的一个簇的显著术语。这可能表明RNA加工基因和RNA转运基因在胚胎干细胞分化过程中是共同调控的。
提供GoSurfer软件来分析多个基因簇,并识别在任何基因簇中富集的GO术语。可在以下网址获取GoSurfer:www.gosurfer.org。