Popescu Mihail, Keller James M, Mitchell Joyce A
Health Management and Informatics Department, University of Missouri, Columbia, MO 65211, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2006 Jul-Sep;3(3):263-74. doi: 10.1109/TCBB.2006.37.
One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
生物信息学中最重要的对象之一是基因产物(蛋白质或RNA)。对于许多基因产物,功能信息总结在一组基因本体论(GO)注释中。对于这些基因,基于GO或其他分类法中发现的术语纳入相似性度量是合理的。在本文中,我们引入了几种用于计算两个带有GO术语注释的基因产物相似性的新度量。模糊度量相似性(FMS)的优点在于,在计算两个基因产物之间的相似性时,它考虑了注释术语全集的上下文。当两个基因产物没有被共同的分类法术语注释时,我们提出了一种避免相似性结果为零的方法。为了考虑注释可靠性的差异,我们提出了一种基于Choquet积分的相似性度量。这些相似性度量为寻求基因产物功能信息的生物学家提供了额外的工具。对代表三个蛋白质家族的194个序列进行的初步测试表明,与传统相似性度量(如成对平均值或成对最大值)相比,FMS和Choquet相似性与BLAST序列相似性具有更高的相关性。