Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.
Mol Phylogenet Evol. 2013 Apr;67(1):277-96. doi: 10.1016/j.ympev.2013.01.020. Epub 2013 Feb 9.
A supermatrix of 272 terminals from Rubiaceae tribe Spermacoceae that were scored for up to 10 gene regions (two nrDNA, eight plastid) was used as an empirical example to quantify sources of error in heuristic parametric (Bayesian MCMC and maximum likelihood) phylogenetic analyses. The supermatrix includes dramatic disparities in which terminals were sampled for which gene regions. The sources of error examined include poor quality tree searches, requiring a single fully resolved optimal tree, undersampling-within-replicates and frequency-within-replicates bootstrap artifacts, and extrapolation from one character partition to another such that synapomorphies that would only be ambiguously optimized by parsimony are optimized with high probability by parametric methods. Four of our conclusions are as follows. (1) The resolution and support provided by parametric methods for clades that lack unambiguously optimized (by parsimony) synapomorphies are less robust to the addition of terminals and characters than those clades that have unambiguously optimized synapomorphies. (2) Those tree-search methods which can create phylogenetic artifacts (frequency-within-replicates resampling, undersampling-within-replicates resampling, requiring a single fully resolved optimal tree, non-independent resampling among replicates) provide the greatest resolution and support irrespective of whether that resolution or support is corroborated by more conservative and better justified methods. (3) Partitioning data matrices cannot be relied upon to consistently obviate potentially dubious resolution and support caused by missing-data artifacts in likelihood analyses when the models require linked branch lengths among partitions. (4) Undersampling-within-replicates and frequency-within-replicates resampling artifacts are not unique to parsimony and should be accounted for in likelihood analyses by allowing multiple equally likely trees to be saved within each resampling pseudoreplicate and applying the strict-consensus bootstrap rather than the frequency-within-replicates bootstrap.
我们使用了一个包含 272 个茜草科(Rubiaceae)穗花族(Spermacoceae)末端的超级矩阵,这些末端被用于多达 10 个基因区域(两个 nrDNA,八个质体)的评分,作为一个经验实例来量化启发式参数(贝叶斯 MCMC 和最大似然)系统发育分析中的误差来源。超级矩阵包括在采样时终端在哪些基因区域上的差异很大。检查的误差来源包括质量较差的树搜索,需要一个完全解决的最优树,复制内采样不足和复制内频率引导的枝长同化物,以及从一个字符分区推断到另一个分区,使得仅通过简约法优化的同源特征以高概率通过参数方法进行优化。我们的四个结论如下。(1)对于缺乏明确优化(简约法)的同源特征的分支,参数方法提供的分辨率和支持,对于终端和字符的增加不如那些具有明确优化的同源特征的分支稳健。(2)那些可以产生系统发育枝长同化物的树搜索方法(复制内频率采样,复制内采样不足,需要一个完全解决的最优树,复制间非独立采样)提供了最大的分辨率和支持,无论这种分辨率或支持是否得到更保守和更有理由的方法的证实。(3)当模型需要在分区之间链接枝长时,数据矩阵的分区不能被依赖于一致地消除似然分析中缺失数据枝长同化物可能引起的可疑分辨率和支持。(4)复制内采样不足和复制内频率采样枝长同化物不是简约法特有的,应该通过允许在每个复制伪重复内保存多个同等可能的树,并应用严格一致的引导而不是复制内频率引导来考虑在似然分析中。