Huerta-Cepas Jaime, Dopazo Hernán, Dopazo Joaquín, Gabaldón Toni
Bioinformatics Department, Centro de Investigación Príncipe Felipe, Autopista del Saler, 46013 Valencia, Spain.
Genome Biol. 2007;8(6):R109. doi: 10.1186/gb-2007-8-6-r109.
Phylogenomics analyses serve to establish evolutionary relationships among organisms and their genes. A phylome, the complete collection of all gene phylogenies in a genome, constitutes a valuable source of information, but its use in large genomes still constitutes a technical challenge. The use of phylomes also requires the development of new methods that help us to interpret them.
We reconstruct here the human phylome, which includes the evolutionary relationships of all human proteins and their homologs among 39 fully sequenced eukaryotes. Phylogenetic techniques used include alignment trimming, branch length optimization, evolutionary model testing and maximum likelihood and Bayesian methods. Although differences with alternative topologies are minor, most of the trees support the Coelomata and Unikont hypotheses as well as the grouping of primates with laurasatheria to the exclusion of rodents. We assess the extent of gene duplication events and their relationship with the functional roles of the protein families involved. We find support for at least one, and probably two, rounds of whole genome duplications before vertebrate radiation. Using a novel algorithm that is independent from a species phylogeny, we derive orthology and paralogy relationships of human proteins among eukaryotic genomes.
Topological variations among phylogenies for different genes are to be expected, highlighting the danger of gene-sampling effects in phylogenomic analyses. Several links can be established between the functions of gene families duplicated at certain phylogenetic splits and major evolutionary transitions in those lineages. The pipeline implemented here can be easily adapted for use in other organisms.
系统发育基因组学分析用于建立生物体及其基因之间的进化关系。一个基因组中所有基因系统发育树的完整集合——系统发育谱,是一个有价值的信息来源,但在大型基因组中的应用仍然是一项技术挑战。系统发育谱的使用还需要开发新的方法来帮助我们解读它们。
我们在此重建了人类系统发育谱,其中包括39种全基因组测序真核生物中所有人类蛋白质及其同源物的进化关系。所使用的系统发育技术包括序列比对修剪、分支长度优化、进化模型测试以及最大似然法和贝叶斯方法。尽管与其他拓扑结构的差异较小,但大多数树状图支持体腔动物和单鞭毛生物假说,以及灵长类与劳亚兽总目的分组,而将啮齿动物排除在外。我们评估了基因复制事件的程度及其与所涉及蛋白质家族功能作用的关系。我们发现有证据支持在脊椎动物辐射之前至少发生了一轮,可能是两轮全基因组复制。使用一种独立于物种系统发育的新算法,我们得出了真核生物基因组中人类蛋白质的直系同源和旁系同源关系。
不同基因的系统发育树之间存在拓扑变异是可以预期的,这突出了系统发育基因组学分析中基因抽样效应的风险。在某些系统发育分支中复制的基因家族功能与这些谱系中的主要进化转变之间可以建立几个联系。这里实施的流程可以很容易地适用于其他生物体。