Kosakovsky Pond Sergei L, Mannino Frank V, Gravenor Michael B, Muse Spencer V, Frost Simon D W
Department of Pathology, University of California, San Diego, USA.
Mol Biol Evol. 2007 Jan;24(1):159-70. doi: 10.1093/molbev/msl144. Epub 2006 Oct 12.
The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings. We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models. Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.
选择一个概率模型来描述序列进化是可以而且应该有充分理由的。使用过于简单的模型对数据拟合不足可能会错过有趣的现象并导致错误的推断。使用过于复杂的模型对数据过度拟合可能会将生物学意义归因于统计假象,并导致错误的显著结果。我们描述了一种基于似然性的进化模型选择方法。该程序采用遗传算法(GA)来快速探索具有固定替换率数量的所有可能的时间可逆马尔可夫模型的组合量大的集合。当应用于受到充分理解的进化力影响的茎RNA数据时,GA找到的模型:1)先验地捕捉预期的总体速率模式;2)比基于先验假设的最佳可用模型更好地拟合数据,表明存在以前未识别的微妙替换模式;3)不能被拒绝而支持一般可逆模型,这意味着仅用几个替换率参数就可以很好地解释茎RNA序列的进化;4)在模拟数据上表现良好,无论是在拟合优度还是估计进化速率的能力方面。我们还研究了几种距离度量在比较和对比推断的进化模型方面的效用。使用广泛可用的小型计算机集群,我们的方法首次允许通过将现有RNA进化模型与大量候选模型进行比较来评估其性能,并验证常见的建模假设。此外,新方法为严格选择和比较其他类型序列数据的替换模型提供了基础。