Li Jing, Wang Wei
National Laboratory of Solid State Microstructure and Department of Physics, Nanjing University, Nanjing, 210093, China.
Sci China C Life Sci. 2007 Jun;50(3):392-402. doi: 10.1007/s11427-007-0023-3.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.
序列比对是寻找蛋白质结构保守/相似区域的常用方法。然而,如果待比对序列之间的序列同一性小于30%,序列比对往往不准确。这是因为对于这些序列,不同的残基可能发挥相似的结构作用,并且在使用由20种残基组成的替换矩阵进行序列比对时,它们会被错误比对。基于物理化学特征的相似性,残基可以被聚类成几组。使用这种简化字母表,蛋白质序列的复杂性降低,同时序列中编码的关键信息得以保留。因此,如果残基得到适当聚类,序列比对的准确性可能会提高。在此,通过使用比对后的蛋白质结构数据库(DAPS),提出了一种基于替换分数的新聚类方法用于残基分组,并构建了不同简化程度的残基替换矩阵。通过相对熵分析证实了简化字母表的有效性。将简化字母表应用于通过序列比对识别蛋白质结构保守/相似区域。结果表明,使用N约为9的最佳简化字母表可以提高序列比对的准确性或效率。