School of Natural Sciences, University of Tasmania, Hobart, TAS, Australia.
Syst Biol. 2023 May 19;72(1):92-105. doi: 10.1093/sysbio/syac081.
In molecular phylogenetics, partition models and mixture models provide different approaches to accommodating heterogeneity in genomic sequencing data. Both types of models generally give a superior fit to data than models that assume the process of sequence evolution is homogeneous across sites and lineages. The Akaike Information Criterion (AIC), an estimator of Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) are popular tools to select models in phylogenetics. Recent work suggests that AIC should not be used for comparing mixture and partition models. In this work, we clarify that this difficulty is not fully explained by AIC misestimating the Kullback-Leibler divergence. We also investigate the performance of the AIC and BIC at comparing amongst mixture models and amongst partition models. We find that under nonstandard conditions (i.e. when some edges have small expected number of changes), AIC underestimates the expected Kullback-Leibler divergence. Under such conditions, AIC preferred the complex mixture models and BIC preferred the simpler mixture models. The mixture models selected by AIC had a better performance in estimating the edge length, while the simpler models selected by BIC performed better in estimating the base frequencies and substitution rate parameters. In contrast, AIC and BIC both prefer simpler partition models over more complex partition models under nonstandard conditions, despite the fact that the more complex partition model was the generating model. We also investigated how mispartitioning (i.e., grouping sites that have not evolved under the same process) affects both the performance of partition models compared with mixture models and the model selection process. We found that as the level of mispartitioning increases, the bias of AIC in estimating the expected Kullback-Leibler divergence remains the same, and the branch lengths and evolutionary parameters estimated by partition models become less accurate. We recommend that researchers are cautious when using AIC and BIC to select among partition and mixture models; other alternatives, such as cross-validation and bootstrapping, should be explored, but may suffer similar limitations [AIC; BIC; mispartitioning; partitioning; partition model; mixture model].
在分子系统发生学中,分区模型和混合模型为处理基因组测序数据中的异质性提供了不同的方法。这两种类型的模型通常比那些假设序列进化过程在站点和谱系上是同质的模型更能拟合数据。Akaike 信息准则 (AIC),一种用于估计 Kullback-Leibler 散度的估计量,以及贝叶斯信息准则 (BIC) 是系统发生学中选择模型的流行工具。最近的研究表明,AIC 不应用于比较混合模型和分区模型。在这项工作中,我们澄清了这一困难并不是完全由 AIC 对 Kullback-Leibler 散度的错误估计所导致的。我们还研究了 AIC 和 BIC 在比较混合模型和分区模型时的性能。我们发现,在非标准条件下(即当某些边缘的预期变化数量较少时),AIC 低估了预期的 Kullback-Leibler 散度。在这种情况下,AIC 更喜欢复杂的混合模型,而 BIC 更喜欢简单的混合模型。AIC 选择的混合模型在估计边缘长度方面表现更好,而 BIC 选择的简单模型在估计碱基频率和替代率参数方面表现更好。相比之下,在非标准条件下,AIC 和 BIC 都更喜欢简单的分区模型而不是更复杂的分区模型,尽管更复杂的分区模型是生成模型。我们还研究了误分区(即,将没有在同一过程下进化的位点分组)如何影响分区模型与混合模型的性能比较以及模型选择过程。我们发现,随着误分区水平的增加,AIC 在估计预期 Kullback-Leibler 散度时的偏差保持不变,分区模型估计的分支长度和进化参数变得不够准确。我们建议研究人员在使用 AIC 和 BIC 选择分区和混合模型时要谨慎;应该探索其他替代方案,如交叉验证和自举,但可能会遇到类似的限制[AIC;BIC;误分区;分区;分区模型;混合模型]。