Springer Mark S, Gatesy John
Department of Biology, University of California, Riverside, CA 92521, USA.
Mol Phylogenet Evol. 2016 Jan;94(Pt A):1-33. doi: 10.1016/j.ympev.2015.07.018. Epub 2015 Jul 31.
Higher-level relationships among placental mammals are mostly resolved, but several polytomies remain contentious. Song et al. (2012) claimed to have resolved three of these using shortcut coalescence methods (MP-EST, STAR) and further concluded that these methods, which assume no within-locus recombination, are required to unravel deep-level phylogenetic problems that have stymied concatenation. Here, we reanalyze Song et al.'s (2012) data and leverage these re-analyses to explore key issues in systematics including the recombination ratchet, gene tree stoichiometry, the proportion of gene tree incongruence that results from deep coalescence versus other factors, and simulations that compare the performance of coalescence and concatenation methods in species tree estimation. Song et al. (2012) reported an average locus length of 3.1 kb for the 447 protein-coding genes in their phylogenomic dataset, but the true mean length of these loci (start codon to stop codon) is 139.6 kb. Empirical estimates of recombination breakpoints in primates, coupled with consideration of the recombination ratchet, suggest that individual coalescence genes (c-genes) approach ∼12 bp or less for Song et al.'s (2012) dataset, three to four orders of magnitude shorter than the c-genes reported by these authors. This result has general implications for the application of coalescence methods in species tree estimation. We contend that it is illogical to apply coalescence methods to complete protein-coding sequences. Such analyses amalgamate c-genes with different evolutionary histories (i.e., exons separated by >100,000 bp), distort true gene tree stoichiometry that is required for accurate species tree inference, and contradict the central rationale for applying coalescence methods to difficult phylogenetic problems. In addition, Song et al.'s (2012) dataset of 447 genes includes 21 loci with switched taxonomic names, eight duplicated loci, 26 loci with non-homologous sequences that are grossly misaligned, and numerous loci with >50% missing data for taxa that are misplaced in their gene trees. These problems were compounded by inadequate tree searches with nearest neighbor interchange branch swapping and inadvertent application of substitution models that did not account for among-site rate heterogeneity. Sixty-six gene trees imply unrealistic deep coalescences that exceed 100 million years (MY). Gene trees that were obtained with better justified models and search parameters show large increases in both likelihood scores and congruence. Coalescence analyses based on a curated set of 413 improved gene trees and a superior coalescence method (ASTRAL) support a Scandentia (treeshrews)+Glires (rabbits, rodents) clade, contradicting one of the three primary systematic conclusions of Song et al. (2012). Robust support for a Perissodactyla+Carnivora clade within Laurasiatheria is also lost, contradicting a second major conclusion of this study. Song et al.'s (2012) MP-EST species tree provided the basis for circular simulations that led these authors to conclude that the multispecies coalescent accounts for 77% of the gene tree conflicts in their dataset, but many internal branches of their MP-EST tree are stunted by an order of magnitude or more due to wholesale gene tree reconstruction errors. An independent assessment of branch lengths suggests the multispecies coalescent accounts for ⩽ 15% of the conflicts among Song et al.'s (2012) 447 gene trees. Unfortunately, Song et al.'s (2012) flawed phylogenomic dataset has been used as a model for additional simulation work that suggests the superiority of shortcut coalescence methods relative to concatenation. Investigator error was passed on to the subsequent simulation studies, which also incorporated further logical errors that should be avoided in future simulation studies. Illegitimate branch length switches in the simulation routines unfairly protected coalescence methods from their Achilles' heel, high gene tree reconstruction error at short internodes. These simulations therefore provide no evidence that shortcut coalescence methods out-compete concatenation at deep timescales. In summary, the long c-genes that are required for accurate reconstruction of species trees using shortcut coalescence methods do not exist and are a delusion. Coalescence approaches based on SNPs that are widely spaced in the genome avoid problems with the recombination ratchet and merit further pursuit in both empirical systematic research and simulations.
胎盘哺乳动物之间的高级关系大多已得到解决,但仍有几个多歧分支存在争议。宋等人(2012年)声称使用捷径合并方法(MP-EST、STAR)解决了其中三个多歧分支,并进一步得出结论,这些假设基因座内无重组的方法对于解决困扰串联法的深层次系统发育问题是必要的。在此,我们重新分析了宋等人(2012年)的数据,并利用这些重新分析来探讨系统分类学中的关键问题,包括重组棘轮、基因树化学计量、由深度合并与其他因素导致的基因树不一致比例,以及比较合并法和串联法在物种树估计中性能的模拟。宋等人(2012年)报告称,其系统发育基因组数据集中447个蛋白质编码基因的平均基因座长度为3.1 kb,但这些基因座的实际平均长度(起始密码子到终止密码子)为139.6 kb。对灵长类动物重组断点的实证估计,再加上对重组棘轮的考虑,表明对于宋等人(2012年)的数据集,单个合并基因(c-基因)接近12 bp或更短,比这些作者报告的c-基因短三到四个数量级。这一结果对合并法在物种树估计中的应用具有普遍意义。我们认为,将合并法应用于完整的蛋白质编码序列是不合逻辑的。此类分析将具有不同进化历史的c-基因(即被超过100,000 bp分隔的外显子)合并在一起,扭曲了准确推断物种树所需的真实基因树化学计量,并且与将合并法应用于困难系统发育问题的核心原理相矛盾。此外,宋等人(2012年)的447个基因的数据集包括21个分类名称转换的基因座、8个重复基因座、26个非同源序列严重错配的基因座,以及许多在其基因树中位置错误的分类单元缺失数据超过50%的基因座。使用最近邻交换分支交换进行的树搜索不足,以及无意中应用未考虑位点间速率异质性的替代模型,使这些问题更加复杂。66个基因树暗示了超过1亿年(MY)的不切实际的深度合并。使用更合理的模型和搜索参数获得的基因树在似然得分和一致性方面都有大幅提高。基于一组经过整理的413个改进基因树和一种更优的合并方法(ASTRAL)进行的合并分析支持树鼩目(树鼩)+啮齿目(兔子、啮齿动物)分支,这与宋等人(2012年)的三个主要系统发育结论之一相矛盾。对劳亚兽总目内奇蹄目+食肉目分支的有力支持也不复存在,这与该研究的第二个主要结论相矛盾。宋等人(2012年)的MP-EST物种树为循环模拟提供了基础,这些作者据此得出结论,多物种合并占其数据集中基因树冲突的77%,但其MP-EST树的许多内部分支由于大规模基因树重建错误而缩短了一个数量级或更多。对分支长度的独立评估表明,多物种合并占宋等人(2012年)447个基因树之间冲突的比例≤15%。不幸的是,宋等人(2012年)有缺陷的系统发育基因组数据集已被用作额外模拟工作的模型,这些模拟表明捷径合并方法相对于串联法具有优越性。研究者的错误传递到了后续的模拟研究中,这些研究还包含了未来模拟研究应避免的进一步逻辑错误。模拟程序中不合法的分支长度切换不公平地保护了合并方法免受其致命弱点——短节间高基因树重建错误的影响。因此,这些模拟没有提供证据表明捷径合并方法在深层次时间尺度上比串联法更具优势。总之,使用捷径合并方法准确重建物种树所需的长c-基因并不存在,这是一种错觉。基于基因组中广泛间隔的单核苷酸多态性(SNP)的合并方法避免了重组棘轮问题,值得在实证系统发育研究和模拟中进一步探索。