Mar Jessica C, Harlow Timothy J, Ragan Mark A
1Department of Mathematics, The University of Queensland, Brisbane, Qld 4072, Australia.
BMC Evol Biol. 2005 Jan 28;5:8. doi: 10.1186/1471-2148-5-8.
Bayesian phylogenetic inference holds promise as an alternative to maximum likelihood, particularly for large molecular-sequence data sets. We have investigated the performance of Bayesian inference with empirical and simulated protein-sequence data under conditions of relative branch-length differences and model violation.
With empirical protein-sequence data, Bayesian posterior probabilities provide more-generous estimates of subtree reliability than does the nonparametric bootstrap combined with maximum likelihood inference, reaching 100% posterior probability at bootstrap proportions around 80%. With simulated 7-taxon protein-sequence datasets, Bayesian posterior probabilities are somewhat more generous than bootstrap proportions, but do not saturate. Compared with likelihood, Bayesian phylogenetic inference can be as or more robust to relative branch-length differences for datasets of this size, particularly when among-sites rate variation is modeled using a gamma distribution. When the (known) correct model was used to infer trees, Bayesian inference recovered the (known) correct tree in 100% of instances in which one or two branches were up to 20-fold longer than the others. At ratios more extreme than 20-fold, topological accuracy of reconstruction degraded only slowly when only one branch was of relatively greater length, but more rapidly when there were two such branches. Under an incorrect model of sequence change, inaccurate trees were sometimes observed at less extreme branch-length ratios, and (particularly for trees with single long branches) such trees tended to be more inaccurate. The effect of model violation on accuracy of reconstruction for trees with two long branches was more variable, but gamma-corrected Bayesian inference nonetheless yielded more-accurate trees than did either maximum likelihood or uncorrected Bayesian inference across the range of conditions we examined. Assuming an exponential Bayesian prior on branch lengths did not improve, and under certain extreme conditions significantly diminished, performance. The two topology-comparison metrics we employed, edit distance and Robinson-Foulds symmetric distance, yielded different but highly complementary measures of performance.
Our results demonstrate that Bayesian inference can be relatively robust against biologically reasonable levels of relative branch-length differences and model violation, and thus may provide a promising alternative to maximum likelihood for inference of phylogenetic trees from protein-sequence data.
贝叶斯系统发育推断有望成为最大似然法的替代方法,尤其适用于大型分子序列数据集。我们在相对分支长度差异和模型违背的条件下,利用经验性和模拟的蛋白质序列数据研究了贝叶斯推断的性能。
对于经验性蛋白质序列数据,贝叶斯后验概率比非参数自展法结合最大似然推断能更宽松地估计子树可靠性,在自展比例约为80%时后验概率达到100%。对于模拟的7分类单元蛋白质序列数据集,贝叶斯后验概率比自展比例稍宽松,但未达到饱和。与似然法相比,对于这种规模的数据集,贝叶斯系统发育推断对相对分支长度差异的稳健性相当或更强,特别是当使用伽马分布对位点间速率变化进行建模时。当使用(已知的)正确模型推断树时,在一个或两个分支比其他分支长20倍的情况下,贝叶斯推断在100%的实例中都能恢复(已知的)正确树。在比20倍更极端的比例下,当只有一个分支长度相对更长时,重建的拓扑准确性仅缓慢下降,但当有两个这样的分支时下降更快。在序列变化的错误模型下,在不太极端的分支长度比例下有时会观察到不准确的树,并且(特别是对于有单个长分支的树)这样的树往往更不准确。模型违背对有两个长分支的树的重建准确性的影响更具变异性,但在我们研究的各种条件下,伽马校正的贝叶斯推断比最大似然法或未校正的贝叶斯推断产生的树更准确。假设在分支长度上采用指数贝叶斯先验并没有提高性能,并且在某些极端条件下显著降低了性能。我们采用的两种拓扑比较指标,编辑距离和罗宾逊 - 福尔兹对称距离,产生了不同但高度互补的性能度量。
我们的结果表明,贝叶斯推断对于生物学上合理水平的相对分支长度差异和模型违背可能具有相对较强的稳健性,因此可能为从蛋白质序列数据推断系统发育树提供一种有前景的替代最大似然法的方法。