Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia.
Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia
Mol Biol Evol. 2016 Jan;33(1):255-67. doi: 10.1093/molbev/msv207. Epub 2015 Sep 28.
Determining the time scale of virus evolution is central to understanding their origins and emergence. The phylogenetic methods commonly used for this purpose can be misleading if the substitution model makes incorrect assumptions about the data. Empirical studies consider a pool of models and select that with the highest statistical fit. However, this does not allow the rejection of all models, even if they poorly describe the data. An alternative is to use model adequacy methods that evaluate the ability of a model to predict hypothetical future observations. This can be done by comparing the empirical data with data generated under the model in question. We conducted simulations to evaluate the sensitivity of such methods with nucleotide, amino acid, and codon data. These effectively detected underparameterized models, but failed to detect mutational saturation and some instances of nonstationary base composition, which can lead to biases in estimates of tree topology and length. To test the applicability of these methods with real data, we analyzed nucleotide and amino acid data sets from the genus Flavivirus of RNA viruses. In most cases these models were inadequate, with the exception of a data set of relatively closely related sequences of Dengue virus, for which the GTR+Γ nucleotide and LG+Γ amino acid substitution models were adequate. Our results partly explain the lack of consensus over estimates of the long-term evolutionary time scale of these viruses, and indicate that assessing the adequacy of substitution models should be routinely used to determine whether estimates are reliable.
确定病毒进化的时间尺度对于了解它们的起源和出现至关重要。为此目的通常使用的系统发育方法如果替代模型对数据做出不正确的假设,可能会产生误导。实证研究考虑了一组模型,并选择了具有最高统计拟合度的模型。然而,这并不能排除所有模型,即使它们不能很好地描述数据。另一种方法是使用模型适当性方法来评估模型预测假设未来观测值的能力。这可以通过将经验数据与在有问题的模型下生成的数据进行比较来完成。我们进行了模拟,以评估核苷酸、氨基酸和密码子数据的此类方法的敏感性。这些方法有效地检测到参数不足的模型,但未能检测到突变饱和和某些非平稳碱基组成的情况,这可能导致树拓扑和长度估计的偏差。为了用真实数据测试这些方法的适用性,我们分析了 RNA 病毒属黄病毒的核苷酸和氨基酸数据集。在大多数情况下,这些模型都不适当,除了登革热病毒相对密切相关序列的数据集外,该数据集的 GTR+Γ 核苷酸和 LG+Γ 氨基酸取代模型是适当的。我们的结果部分解释了对这些病毒长期进化时间尺度估计缺乏共识的原因,并表明应例行评估替代模型的适当性,以确定估计值是否可靠。