一种改进的通用氨基酸置换矩阵。

An improved general amino acid replacement matrix.

作者信息

Le Si Quang, Gascuel Olivier

机构信息

Méthodes et Algorithmes pour la Bioinformatique, LIRMM, CNRS, Université Montpellier II, Montpellier, France.

出版信息

Mol Biol Evol. 2008 Jul;25(7):1307-20. doi: 10.1093/molbev/msn067. Epub 2008 Mar 26.

DOI:10.1093/molbev/msn067

PMID:18367465

Abstract

Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.

摘要

氨基酸替换矩阵是蛋白质系统发育学的重要基础。它们用于计算沿系统发育分支的替换概率，进而计算数据的似然性。它们在蛋白质比对中也至关重要。自Dayhoff等人（1972年）的开创性工作以来，已经提出了许多替换矩阵以及从蛋白质比对中估计这些矩阵的方法。Whelan和Goldman（2001年）及其WAG矩阵取得了一项重要进展，这得益于一种有效的最大似然估计方法，该方法考虑了每个训练比对中序列的系统发育。我们通过在矩阵估计中纳入位点间进化速率的变异性，并使用比用于估计WAG的BRKALN大得多且更多样化的数据库，进一步完善了该方法。为了估计我们的新矩阵（以作者名字命名为LG），我们使用了XRATE软件的一个改编版本以及来自Pfam的3912个比对，总共包含约50000个序列和约650万个残基。为了评估LG的性能，我们使用了一个由来自TreeBase的59个比对组成的独立样本，并将Pfam比对随机分为3412个训练比对和500个测试比对。与WAG和JTT的比较显示出似然性有明显提高。使用TreeBase，我们发现：一是与WAG和JTT相比，每个位点的平均赤池信息准则增益分别为0.25和0.42；二是在59个比对中，LG在38个比对上显著优于WAG，仅在2个比对上显著更差；三是用LG、WAG和JTT推断的树拓扑结构经常不同，这表明使用LG不仅会影响似然值，还会影响输出树。来自Pfam测试比对的结果类似。LG和一个PHYML实现版本可从http://atgc.lirmm.fr/LG下载。