Center for Biomedicine, European Academy of Bozen/Bolzano (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
Sci Rep. 2017 Mar 23;7(1):381. doi: 10.1038/s41598-017-00465-5.
Protein functional similarity based on gene ontology (GO) annotations serves as a powerful tool when comparing proteins on a functional level in applications such as protein-protein interaction prediction, gene prioritization, and disease gene discovery. Functional similarity (FS) is usually quantified by combining the GO hierarchy with an annotation corpus that links genes and gene products to GO terms. One large group of algorithms involves calculation of GO term semantic similarity (SS) between all the terms annotating the two proteins, followed by a second step, described as "mixing strategy", which involves combining the SS values to yield the final FS value. Due to the variability of protein annotation caused e.g. by annotation bias, this value cannot be reliably compared on an absolute scale. We therefore introduce a similarity z-score that takes into account the FS background distribution of each protein. For a selection of popular SS measures and mixing strategies we demonstrate moderate accuracy improvement when using z-scores in a benchmark that aims to separate orthologous cases from random gene pairs and discuss in this context the impact of annotation corpus choice. The approach has been implemented in Frela, a fast high-throughput public web server for protein FS calculation and interpretation.
基于基因本体 (GO) 注释的蛋白质功能相似性在蛋白质功能水平比较方面是一种强大的工具,可应用于蛋白质-蛋白质相互作用预测、基因优先级和疾病基因发现等领域。功能相似性 (FS) 通常通过将 GO 层次结构与注释语料库相结合来量化,该语料库将基因和基因产物与 GO 术语联系起来。一类大型算法涉及计算注释两个蛋白质的所有术语之间的 GO 术语语义相似性 (SS),然后是第二步,描述为“混合策略”,涉及组合 SS 值以得出最终的 FS 值。由于蛋白质注释的可变性,例如注释偏差,因此不能在绝对尺度上可靠地比较此值。因此,我们引入了相似性 z 分数,该分数考虑了每个蛋白质的 FS 背景分布。对于选择的流行 SS 度量和混合策略,我们在旨在将同源案例与随机基因对分开的基准测试中展示了适度的准确性提高,并在该上下文中讨论了注释语料库选择的影响。该方法已在 Frela 中实现,Frela 是一个快速的高通量公共网络服务器,用于计算和解释蛋白质 FS。