Department of Computer Science and Operations Research, Université de Montréal, Montréal, Québec, Canada.
PLoS Comput Biol. 2011 Sep;7(9):e1002150. doi: 10.1371/journal.pcbi.1002150. Epub 2011 Sep 15.
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6-7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
真核生物的蛋白质编码基因被内含子打断,但真核生物谱系之间的内含子密度差异很大。脊椎动物、一些无脊椎动物和绿色植物的基因富含内含子,每千碱基编码序列有 6-7 个内含子,而其他大多数真核生物的基因则内含子较少。我们使用概率马尔可夫模型(马尔可夫链蒙特卡罗,MCMC)对来自 99 个基因组的 245 个直系同源基因进行了分析,这些基因组代表了有多个基因组序列的真核生物的五个超级群中的三个。每个主要群体的内含子丰富的祖先都被自信地重建,人类的内含子密度有 53%到 74%可以用 95%置信区间推断出最后真核生物共同祖先(LECA)的内含子密度。MCMC 重建的结果与使用最大似然(ML)和多洛简约法获得的重建结果进行了比较。MCMC 和 ML 的推断结果之间表现出极好的一致性,而多洛简约法在估计中引入了明显的偏差,通常导致比 MCMC 和 ML 更低的祖先内含子密度。真核生物基因的进化主要由内含子丢失驱动,只有在包括植物和动物在内的几个主要分支的基础上才有大量的内含子获得。动物的最后共同祖先的内含子密度最高,为人类值的 120%到 130%。重建结果表明,从 LECA 到哺乳动物的整个谱系都是内含子丰富的,这有利于选择性剪接的进化。