Feng D F, Johnson M S, Doolittle R F
J Mol Evol. 1984;21(2):112-25. doi: 10.1007/BF02100085.
We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of "weighting" in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical "jumbling test." This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.
我们使用四种不同的比对方案研究了两个庞大的蛋白质序列家族,这些方案采用了不同程度的“加权”,以确定哪种方法在建立关系时最敏感。所有比对都使用了基于Needleman和Wunsch设计的通用算法的相似性方法。这些方法包括一个简单的程序,UM(单位矩阵),只对相同性进行评分;一种以遗传密码为加权基础的方案(GC);另一种采用基于氨基酸结构相似性与突变遗传基础相结合的矩阵(SG);以及第四种使用Dayhoff根据观察到的氨基酸替换情况开发的经验对数似然矩阵(LOM)。所研究的两个序列家族分别是:(a)九种不同的珠蛋白和(b)九种不同的酪氨酸激酶样蛋白。事先假定一个家族的所有成员都有共同的祖先。在两条序列的相同性超过30%的情况下,所有四种方法的比对结果几乎总是相同的。然而,在相同性百分比低于20%的情况下,比对结果往往存在显著差异。平均而言,根据经验性的“重排测试”判断,Dayhoff LOM方法在验证远缘关系方面最有效。然而,情况并非总是如此,在某些情况下,简单的UM实际上同样有效或更好。根据各种比对构建的树在分支长度方面有所不同,但基本分支顺序相同。我们提出了四种方法在两种不同序列背景下有效性不同的一些原因,并提供了一些评估序列关系重要性的经验法则。