Edwards Scott V, Liu Liang, Pearl Dennis K
Department of Organismic and Evolutionary Biology, and Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, USA.
Proc Natl Acad Sci U S A. 2007 Apr 3;104(14):5936-41. doi: 10.1073/pnas.0607004104. Epub 2007 Mar 28.
The vast majority of phylogenetic models focus on resolution of gene trees, despite the fact that phylogenies of species in which gene trees are embedded are of primary interest. We analyze a Bayesian model for estimating species trees that accounts for the stochastic variation expected for gene trees from multiple unlinked loci sampled from a single species history after a coalescent process. Application of the model to a 106-gene data set from yeast shows that the set of gene trees recovered by statistically acknowledging the shared but unknown species tree from which gene trees are sampled is much reduced compared with treating the history of each locus independently of an overarching species tree. The analysis also yields a concentrated posterior distribution of the yeast species tree whose mode is congruent with the concatenated gene tree but can do so with less than half the loci required by the concatenation method. Using simulations, we show that, with large numbers of loci, highly resolved species trees can be estimated under conditions in which concatenation of sequence data will positively mislead phylogeny, and when the proportion of gene trees matching the species tree is <10%. However, when gene tree/species tree congruence is high, species trees can be resolved with just two or three loci. These results make accessible an alternative paradigm for combining data in phylogenomics that focuses attention on the singularity of species histories and away from the idiosyncrasies and multiplicities of individual gene histories.
绝大多数系统发育模型关注的是基因树的解析,尽管事实上包含基因树的物种系统发育才是首要关注点。我们分析了一种用于估计物种树的贝叶斯模型,该模型考虑了在合并过程后从单个物种历史中采样的多个不连锁基因座的基因树所预期的随机变异。将该模型应用于酵母的一个包含106个基因的数据集表明,与独立于总体物种树处理每个基因座的历史相比,通过统计确认从中采样基因树的共享但未知的物种树而恢复的基因树集大大减少。该分析还产生了酵母物种树的集中后验分布,其模式与串联基因树一致,但所需的基因座不到串联方法所需基因座的一半。通过模拟,我们表明,在有大量基因座的情况下,在序列数据串联会产生正向系统发育误导的条件下,以及当与物种树匹配的基因树比例小于10%时,可以估计出高度解析的物种树。然而,当基因树/物种树一致性很高时,仅用两三个基因座就能解析物种树。这些结果为系统发育基因组学中的数据组合提供了一种替代范式,该范式将注意力集中在物种历史的独特性上,而远离单个基因历史的特质和多样性。