School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA.
PLoS Comput Biol. 2011 Jun;7(6):e1002073. doi: 10.1371/journal.pcbi.1002073. Epub 2011 Jun 9.
A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the "ortholog conjecture"). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act.
在比较基因组学中,一个常见的假设是,直系同源基因比旁系同源基因具有更大的功能相似性(“直系同源假设”)。许多用于计算预测蛋白质功能的方法都是基于这一假设,尽管这一假设在很大程度上尚未得到验证。在这里,我们使用来自人类和小鼠的比较功能基因组数据,首次对直系同源假设进行了大规模测试。我们使用了超过 8900 个基因的实验推导功能,以及一个独立的微阵列数据集,直接评估我们使用直系同源和旁系同源来预测功能的能力。这两个数据集都表明,旁系同源基因通常比直系同源基因更能预测功能,即使在序列同一性较低的情况下也是如此。在旁系同源基因中,同一物种内的基因通常比不同物种中的基因具有更高的功能相似性。我们还发现,位于同一染色体上的旁系同源对比位于不同染色体上的旁系同源对具有更高的功能相似性,这可能是由于这些对之间的基因转换水平较高。除了为蛋白质功能的计算预测提供启示外,我们的结果还揭示了序列分歧和功能分歧之间的关系。我们的结论是,功能进化的最重要因素不是氨基酸序列,而是蛋白质作用的细胞环境。