Barriot Roland, Sherman David J, Dutour Isabelle
CBiB, Bordeaux Bioinformatics Center, Université Victor Segalen Bordeaux 2, 146 Rue Léo Saignat, 33076 Bordeaux, France.
BMC Bioinformatics. 2007 Sep 11;8:332. doi: 10.1186/1471-2105-8-332.
The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered.
We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons).
The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis.
寻找富集特征已被广泛用于描述一组基因或蛋白质。该技术的一个关键方面是其识别异质数据(如基因本体注释、基因表达数据和基因的基因组位置)之间相关性的能力。尽管可用数据迅速增长,但在形式化和优化方面提出的内容却很少。此外,当前方法主要忽略了数据结构,这导致结果冗余。例如,在搜索基因本体术语的富集时,基因可以用多个基因本体术语进行注释,并且应该传播到基因本体中更通用的术语。因此,基因集常常部分或完全重叠,这导致报告的富集基因本体术语既多又冗余,从而用无关信息淹没研究人员。这种情况并非独一无二,每当进行某种层次聚类时(例如基于基因表达谱)都会出现,极端情况是考虑染色体上相邻的基因时。
我们提出了一个通用框架,用于高效识别一组基因中最相关的过度表达特征。我们基于偏序集理论提出了基因集的形式化表示,并给出了目标集相关性的形式化定义。为生成和评估相关目标集提供了算法和目标集的紧凑表示。通过搜索多蛋白复合物中涉及的蛋白质的富集基因本体注释,说明了我们方法的相关性。获得的结果证明了在相关性(去除高达64%的冗余)、空间需求(存储减少高达73%)和效率(比较减少高达98%)方面的收获。
本文提出的通用框架提供了一种形式化方法,以充分表示可用数据,并高效搜索一组基因或蛋白质中相关的过度表达特征。形式化和相关性定义可被当前大多数用于特征富集分析的方法和工具直接使用。