Matsumoto Tomotaka, Akashi Hiroshi, Yang Ziheng
Division of Evolutionary Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan.
Division of Evolutionary Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan Department of Genetics, The Graduate University for Advanced Studies (SOKENDAI), Mishima, Shizuoka 411-8540, Japan
Genetics. 2015 Jul;200(3):873-90. doi: 10.1534/genetics.115.177386. Epub 2015 May 6.
Inference of gene sequences in ancestral species has been widely used to test hypotheses concerning the process of molecular sequence evolution. However, the approach may produce spurious results, mainly because using the single best reconstruction while ignoring the suboptimal ones creates systematic biases. Here we implement methods to correct for such biases and use computer simulation to evaluate their performance when the substitution process is nonstationary. The methods we evaluated include parsimony and likelihood using the single best reconstruction (SBR), averaging over reconstructions weighted by the posterior probabilities (AWP), and a new method called expected Markov counting (EMC) that produces maximum-likelihood estimates of substitution counts for any branch under a nonstationary Markov model. We simulated base composition evolution on a phylogeny for six species, with different selective pressures on G+C content among lineages, and compared the counts of nucleotide substitutions recorded during simulation with the inference by different methods. We found that large systematic biases resulted from (i) the use of parsimony or likelihood with SBR, (ii) the use of a stationary model when the substitution process is nonstationary, and (iii) the use of the Hasegawa-Kishino-Yano (HKY) model, which is too simple to adequately describe the substitution process. The nonstationary general time reversible (GTR) model, used with AWP or EMC, accurately recovered the substitution counts, even in cases of complex parameter fluctuations. We discuss model complexity and the compromise between bias and variance and suggest that the new methods may be useful for studying complex patterns of nucleotide substitution in large genomic data sets.
推断祖先物种的基因序列已被广泛用于检验有关分子序列进化过程的假设。然而,这种方法可能会产生虚假结果,主要是因为使用单一最佳重建而忽略次优重建会产生系统偏差。在这里,我们实施了校正此类偏差的方法,并使用计算机模拟来评估它们在替换过程非平稳时的性能。我们评估的方法包括使用单一最佳重建(SBR)的简约法和似然法、对后验概率加权的重建进行平均(AWP),以及一种称为期望马尔可夫计数(EMC)的新方法,该方法可在非平稳马尔可夫模型下产生任何分支替换计数的最大似然估计。我们在一个六个物种的系统发育树上模拟了碱基组成的进化,各谱系对G+C含量有不同的选择压力,并将模拟过程中记录的核苷酸替换计数与不同方法的推断结果进行了比较。我们发现,大的系统偏差源于:(i)使用带有SBR的简约法或似然法;(ii)在替换过程非平稳时使用平稳模型;(iii)使用过于简单而无法充分描述替换过程的Hasegawa-Kishino-Yano(HKY)模型。与AWP或EMC一起使用的非平稳通用时间可逆(GTR)模型,即使在参数波动复杂的情况下,也能准确地恢复替换计数。我们讨论了模型复杂性以及偏差与方差之间的权衡,并表明新方法可能有助于研究大型基因组数据集中复杂的核苷酸替换模式。