Yu Haiyuan, Jansen Ronald, Stolovitzky Gustavo, Gerstein Mark
Department of Molecular Biophysics & Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA.
Bioinformatics. 2007 Aug 15;23(16):2163-73. doi: 10.1093/bioinformatics/btm291. Epub 2007 May 31.
Many classifications of protein function such as Gene Ontology (GO) are organized in directed acyclic graph (DAG) structures. In these classifications, the proteins are terminal leaf nodes; the categories 'above' them are functional annotations at various levels of specialization and the computation of a numerical measure of relatedness between two arbitrary proteins is an important proteomics problem. Moreover, analogous problems are important in other contexts in large-scale information organization--e.g. the Wikipedia online encyclopedia and the Yahoo and DMOZ web page classification schemes.
Here we develop a simple probabilistic approach for computing this relatedness quantity, which we call the total ancestry method. Our measure is based on counting the number of leaf nodes that share exactly the same set of 'higher up' category nodes in comparison to the total number of classified pairs (i.e. the chance for the same total ancestry). We show such a measure is associated with a power-law distribution, allowing for the quick assessment of the statistical significance of shared functional annotations. We formally compare it with other quantitative functional similarity measures (such as, shortest path within a DAG, lowest common ancestor shared and Azuaje's information-theoretic similarity) and provide concrete metrics to assess differences. Finally, we provide a practical implementation for our total ancestry measure for GO and the MIPS functional catalog and give two applications of it in specific functional genomics contexts.
The implementations and results are available through our supplementary website at: http://gersteinlab.org/proj/funcsim.
Supplementary data are available at Bioinformatics online.
许多蛋白质功能分类,如基因本体论(GO),都是以有向无环图(DAG)结构组织的。在这些分类中,蛋白质是终端叶节点;在它们“之上”的类别是不同专业化水平的功能注释,计算任意两个蛋白质之间相关性的数值度量是一个重要的蛋白质组学问题。此外,类似的问题在大规模信息组织的其他背景下也很重要,例如维基百科在线百科全书以及雅虎和DMOZ网页分类方案。
在这里,我们开发了一种简单的概率方法来计算这种相关性数量,我们称之为总祖先方法。我们的度量基于计算与分类对总数相比,共享完全相同的“更高层次”类别节点集的叶节点数量(即相同总祖先的概率)。我们表明这种度量与幂律分布相关联,允许快速评估共享功能注释的统计显著性。我们将其与其他定量功能相似性度量(如DAG内的最短路径、共享的最低共同祖先和阿苏阿耶的信息论相似性)进行了正式比较,并提供了评估差异的具体指标。最后,我们为GO和MIPS功能目录的总祖先度量提供了一个实际实现,并给出了它在特定功能基因组学背景下的两个应用。
实现和结果可通过我们的补充网站获取:http://gersteinlab.org/proj/funcsim。
补充数据可在《生物信息学》在线获取。