Guimarães Fabreti Luiza, Höhna Sebastian
GeoBio-Center, Ludwig-Maximilians-Universität München, 80333 Munich, Germany.
Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, 80333 Munich, Germany.
Syst Biol. 2023 Dec 30;72(6):1418-1432. doi: 10.1093/sysbio/syad041.
Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough not to overfit. In phylogenetics, the most common model selection scenario concerns selecting an adequate substitution and partition model for sequence evolution to infer a phylogenetic tree. Previously, several studies showed that substitution model under-parameterization can bias phylogenetic studies. Here, we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the $\text{GTR}+\Gamma+\text{I}$ substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored 4 choices of prior distributions: the default substitution model priors of MrBayes, BEAST2, and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization and over-partitioning but only under our new prior settings. All 3 current default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well-behaved prior distributions are applied and more effort should focus on more complex and biologically realistic substitution models.
模型选择旨在为手头的统计分析选择最合适的模型。该模型必须足够复杂以捕捉数据的复杂性,但又要足够简单以免过度拟合。在系统发育学中,最常见的模型选择情况是为序列进化选择合适的替换模型和划分模型以推断系统发育树。此前,多项研究表明替换模型参数设置不足会使系统发育研究产生偏差。在此,我们在贝叶斯统计框架下探讨了替换模型参数设置过度的影响。我们在最简单的替换模型——朱克斯 - 坎托模型下进行模拟,并将真实模型下系统发育树拓扑结构和树长的后验估计与最复杂的模型—— $\text{GTR}+\Gamma+\text{I}$ 替换模型进行比较,包括将数据过度划分为更多子集(即应用划分模型)。我们探讨了4种先验分布选择:MrBayes、BEAST2和RevBayes的默认替换模型先验以及一种新设计的先验选择(Tame)。我们的结果表明,系统发育的贝叶斯推断对替换模型参数设置过度和过度划分具有鲁棒性,但仅在我们新的先验设置下如此。当前所有3种默认先验都会给估计的树长带来偏差。我们得出结论,如果应用表现良好的先验分布,那么在贝叶斯系统发育推断流程中,替换模型和划分模型的选择是多余的步骤,并且应将更多精力集中在更复杂且符合生物学实际的替换模型上。