使用信息内容和基因本体论图的拓扑属性评估蛋白质之间的语义相似性。

Assessment of Semantic Similarity between Proteins Using Information Content and Topological Properties of the Gene Ontology Graph.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2018 May-Jun;15(3):839-849. doi: 10.1109/TCBB.2017.2689762. Epub 2017 Mar 31.

DOI:10.1109/TCBB.2017.2689762

Abstract

The semantic similarity between two interacting proteins can be estimated by combining the similarity scores of the GO terms associated with the proteins. Greater number of similar GO annotations between two proteins indicates greater interaction affinity. Existing semantic similarity measures make use of the GO graph structure, the information content of GO terms, or a combination of both. In this paper, we present a hybrid approach which utilizes both the topological features of the GO graph and information contents of the GO terms. More specifically, we 1) consider a fuzzy clustering of the GO graph based on the level of association of the GO terms, 2) estimate the GO term memberships to each cluster center based on the respective shortest path lengths, and 3) assign weightage to GO term pairs on the basis of their dissimilarity with respect to the cluster centers. We test the performance of our semantic similarity measure against seven other previously published similarity measures using benchmark protein-protein interaction datasets of Homo sapiens and Saccharomyces cerevisiae based on sequence similarity, Pfam similarity, area under ROC curve, and measure.

摘要

可以通过组合与蛋白质相关的 GO 术语的相似得分来估计两个相互作用的蛋白质之间的语义相似度。两个蛋白质之间具有更多相似的 GO 注释表明它们具有更高的相互作用亲和力。现有的语义相似性度量方法利用了 GO 图结构、GO 术语的信息量或两者的组合。在本文中，我们提出了一种混合方法，该方法同时利用了 GO 图的拓扑特征和 GO 术语的信息量。更具体地说，我们 1）根据 GO 术语的关联程度对 GO 图进行模糊聚类，2）根据各自的最短路径长度估计 GO 术语对每个聚类中心的隶属度，3）根据它们与聚类中心的相似度为 GO 术语对分配权重。我们使用基于序列相似性、Pfam 相似性、ROC 曲线下面积和度量的人类和酿酒酵母的基准蛋白质-蛋白质相互作用数据集，针对另外七种先前发布的相似性度量方法来测试我们的语义相似性度量方法的性能。