Bioinformatics Centre, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
BMC Bioinformatics. 2010 May 28;11:290. doi: 10.1186/1471-2105-11-290.
Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them.
First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications.
Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent.
蛋白质对之间的语义相似性评分被广泛应用于功能基因组学研究,用于发现蛋白质的功能簇、预测蛋白质功能和蛋白质-蛋白质相互作用,以及鉴定潜在的疾病基因。然而,由于一些蛋白质,如与疾病相关的蛋白质,往往受到更深入的研究,注释可能存在偏差,这可能会影响基于语义相似性度量的应用。因此,有必要评估这种偏差对蛋白质之间的语义相似性评分的影响,然后找到一种避免这种偏差的方法。
首先,我们评估了 14 种常用的蛋白质对之间的语义相似性评分,并证明它们与蛋白质的注释术语数量(也称为蛋白质注释长度)显著相关。这些结果表明,当前蛋白质之间的语义相似性评分的应用可能不可靠。然后,为了减少这种注释偏差的影响,我们提出了使用分数的幂变换来归一化蛋白质之间的语义相似性评分。我们提供的证据表明,这在某些应用中提高了性能。
当前蛋白质对之间的语义相似性度量高度依赖于蛋白质注释长度,这受到生物研究偏差的影响。这会影响基于这些语义相似性评分的应用,特别是在依赖评分幅度的聚类研究中。本文提出的归一化评分可以在一定程度上减少这种偏差的影响。