Whelan S, Goldman N
Department of Zoology, University of Cambridge, Cambridge, England.
Mol Biol Evol. 2001 May;18(5):691-9. doi: 10.1093/oxfordjournals.molbev.a003851.
Phylogenetic inference from amino acid sequence data uses mainly empirical models of amino acid replacement and is therefore dependent on those models. Two of the more widely used models, the Dayhoff and JTT models, are estimated using similar methods that can utilize large numbers of sequences from many unrelated protein families but are somewhat unsatisfactory because they rely on assumptions that may lead to systematic error and discard a large amount of the information within the sequences. The alternative method of maximum-likelihood estimation may utilize the information in the sequence data more efficiently and suffers from no systematic error, but it has previously been applicable to relatively few sequences related by a single phylogenetic tree. Here, we combine the best attributes of these two methods using an approximate maximum-likelihood method. We implemented this approach to estimate a new model of amino acid replacement from a database of globular protein sequences comprising 3,905 amino acid sequences split into 182 protein families. While the new model has an overall structure similar to those of other commonly used models, there are significant differences. The new model outperforms the Dayhoff and JTT models with respect to maximum-likelihood values for a large majority of the protein families in our database. This suggests that it provides a better overall fit to the evolutionary process in globular proteins and may lead to more accurate phylogenetic tree estimates. Potentially, this matrix, and the methods used to generate it, may also be useful in other areas of research, such as biological sequence database searching, sequence alignment, and protein structure prediction, for which an accurate description of amino acid replacement is required.
基于氨基酸序列数据的系统发育推断主要使用氨基酸替换的经验模型,因此依赖于这些模型。两种使用较为广泛的模型,即Dayhoff模型和JTT模型,是通过类似的方法估计出来的,这些方法可以利用来自许多不相关蛋白质家族的大量序列,但它们有些不尽人意,因为它们依赖的假设可能会导致系统误差,并且会丢弃序列中的大量信息。最大似然估计的替代方法可能能更有效地利用序列数据中的信息,并且不存在系统误差,但此前它仅适用于由单个系统发育树关联的相对较少的序列。在这里,我们使用一种近似最大似然方法结合了这两种方法的最佳特性。我们实施了这种方法,从一个包含3905个氨基酸序列、分为182个蛋白质家族的球状蛋白质序列数据库中估计出一个新的氨基酸替换模型。虽然新模型的整体结构与其他常用模型相似,但也存在显著差异。对于我们数据库中的绝大多数蛋白质家族,新模型在最大似然值方面优于Dayhoff模型和JTT模型。这表明它能更好地整体拟合球状蛋白质的进化过程,可能会带来更准确的系统发育树估计。潜在地,这个矩阵以及用于生成它的方法,在其他研究领域,如生物序列数据库搜索、序列比对和蛋白质结构预测中也可能有用,因为这些领域需要对氨基酸替换进行准确描述。