Storm Christian E V, Sonnhammer Erik L L
Center for Genomics and Bioinformatics, Karolinska Institutet, S-171 77 Stockholm, Sweden.
Bioinformatics. 2002 Jan;18(1):92-9. doi: 10.1093/bioinformatics/18.1.92.
Orthologous proteins in different species are likely to have similar biochemical function and biological role. When annotating a newly sequenced genome by sequence homology, the most precise and reliable functional information can thus be derived from orthologs in other species. A standard method of finding orthologs is to compare the sequence tree with the species tree. However, since the topology of phylogenetic tree is not always reliable one might get incorrect assignments.
Here we present a novel method that resolves this problem by analyzing a set of bootstrap trees instead of the optimal tree. The frequency of orthology assignments in the bootstrap trees can be interpreted as a support value for the possible orthology of the sequences. Our method is efficient enough to analyze data in the scale of whole genomes. It is implemented in Java and calculates orthology support levels for all pairwise combinations of homologous sequences of two species. The method was tested on simulated datasets and on real data of homologous proteins.
不同物种中的直系同源蛋白可能具有相似的生化功能和生物学作用。因此,当通过序列同源性对新测序的基因组进行注释时,最精确和可靠的功能信息可从其他物种的直系同源物中获得。寻找直系同源物的标准方法是将序列树与物种树进行比较。然而,由于系统发育树的拓扑结构并不总是可靠的,可能会得到错误的分配结果。
在此我们提出一种新方法,该方法通过分析一组自引导树而非最优树来解决此问题。自引导树中直系同源分配的频率可解释为序列可能的直系同源性的支持值。我们的方法效率足够高,能够分析全基因组规模的数据。它用Java实现,并计算两个物种同源序列所有成对组合的直系同源支持水平。该方法在模拟数据集和同源蛋白的真实数据上进行了测试。