Université de Lorraine, CNRS, Inria, LORIA, Nancy, F-54500, France.
BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):413. doi: 10.1186/s12859-018-2380-2.
Families of related proteins and their different functions may be described systematically using common classifications and ontologies such as Pfam and GO (Gene Ontology), for example. However, many proteins consist of multiple domains, and each domain, or some combination of domains, can be responsible for a particular molecular function. Therefore, identifying which domains should be associated with a specific function is a non-trivial task.
We describe a general approach for the computational discovery of associations between different sets of annotations by formalising the problem as a bipartite graph enrichment problem in the setting of a tripartite graph. We call this approach "CODAC" (for COmputational Discovery of Direct Associations using Common Neighbours). As one application of this approach, we describe "GODomainMiner" for associating GO terms with protein domains. We used GODomainMiner to predict GO-domain associations between each of the 3 GO ontology namespaces (MF, BP, and CC) and the Pfam, CATH, and SCOP domain classifications. Overall, GODomainMiner yields average enrichments of 15-, 41- and 25-fold GO-domain associations compared to the existing GO annotations in these 3 domain classifications, respectively.
These associations could potentially be used to annotate many of the protein chains in the Protein Databank and protein sequences in UniProt whose domain composition is known but which currently lack GO annotation.
可以使用通用分类法和本体论(例如 Pfam 和 GO(基因本体论))系统地描述相关蛋白家族及其不同功能。然而,许多蛋白质由多个结构域组成,每个结构域或某些结构域组合可能负责特定的分子功能。因此,确定哪些结构域应与特定功能相关联是一项非平凡的任务。
我们描述了一种通过在三分图的设置中将问题形式化为二部图富集问题来计算发现不同注释集之间关联的通用方法。我们称这种方法为“CODAC”(用于使用常见邻居进行直接关联的计算发现)。作为这种方法的一种应用,我们描述了“GODomainMiner”,用于将 GO 术语与蛋白质结构域相关联。我们使用 GODomainMiner 预测了 3 个 GO 本体名称空间(MF、BP 和 CC)与 Pfam、CATH 和 SCOP 结构域分类之间的 GO-结构域关联。总体而言,与这 3 个结构域分类中的现有 GO 注释相比,GODomainMiner 分别产生了平均 15 倍、41 倍和 25 倍的 GO-结构域关联的富集。
这些关联可以潜在地用于注释蛋白质数据库中的许多蛋白质链和 UniProt 中的蛋白质序列,这些链和序列已知其结构域组成,但目前缺乏 GO 注释。