IntelliGO：一种新的基于向量的语义相似性度量方法，包含注释来源。

IntelliGO: a new vector-based semantic similarity measure including annotation origin.

机构信息

LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France.

出版信息

BMC Bioinformatics. 2010 Dec 1;11:588. doi: 10.1186/1471-2105-11-588.

DOI:10.1186/1471-2105-11-588

PMID:21122125

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3098105/

Abstract

BACKGROUND

The Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes).

RESULTS

We present here a new semantic similarity measure called IntelliGO which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The IntelliGO similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO biological process and molecular function terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the IntelliGO similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the IntelliGO similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures.

CONCLUSIONS

The IntelliGO similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering.

AVAILABILITY

An on-line version of the IntelliGO similarity measure is available at: http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/

摘要

背景

基因本体论（GO）是一个著名的受控词汇表，用于描述基因注释的生物学过程、分子功能和细胞成分方面。它已成为生物信息学中广泛使用的知识来源，用于注释基因并衡量它们的语义相似性。这些度量通常涉及 GO 图结构、GO 方面的信息量或两者的组合。然而，到目前为止，描述的语义相似性度量方法中，只有少数方法可以根据它们的来源（即证据代码）来处理 GO 注释。

结果

我们在这里提出了一种新的语义相似性度量方法，称为 IntelliGO，它在新的向量空间模型中集成了几种互补特性。与标记特定基因或蛋白质的每个 GO 术语相关联的系数包括其信息量以及每种类型的 GO 证据代码的定制值。用于计算两个向量之间点积的广义余弦相似性度量已被严格适用于 GO 图的上下文。IntelliGO 相似性度量方法应用于两个基准数据集，分别由 KEGG 途径和 Pfam 结构域组成，分别考虑 GO 生物学过程和分子功能术语，总共涉及 683 个酵母和人类基因，涉及超过 67900 对两两比较。IntelliGO 相似性度量方法表达一组基因的生物学内聚性的能力优于四种现有的相似性度量方法。对于组间比较，它始终可以区分不同的基因集。此外，IntelliGO 相似性度量方法允许检查证据代码分配的权重的影响。最后，与之前发表的度量方法相比，使用互补参考技术获得的结果与序列相似性、Pfam 和酶分类具有中等但正确的相关性值。