Peterson Eric L, Kondev Jané, Theriot Julie A, Phillips Rob
Department of Physics, California Institute of Technology, Pasadena, CA 91125, USA.
Bioinformatics. 2009 Jun 1;25(11):1356-62. doi: 10.1093/bioinformatics/btp164. Epub 2009 Apr 7.
Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet.
We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance mAtrix aLIgnment database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches.
A table of results as well as the substitution matrices and residue groupings from this study can be downloaded from (http://www.rpgroup.caltech.edu/publications/supplements/alphabets).
正如蛋白质数据库中现有的大量结构所证明的那样,许多序列差异极大的蛋白质却具有共同的折叠结构。在各种应用中取得成功的一个想法是简化氨基酸字母表的概念,即将相似的氨基酸聚类在一起。鉴于许多明显不同的序列所表现出的结构相似性,我们开展了这项研究,通过比较用简化字母表编写的蛋白质序列来寻找折叠识别方面的改进。
我们用距离矩阵比对数据库中的序列进行全对全成对序列比对,测试了文献中提出的150多种氨基酸聚类方案。我们结合了文献中流行的几种信息检索指标:平均精度、接收器操作特征曲线下的面积以及固定错误率下的召回率,发现与之前的工作相反,在许多情况下简化字母表的表现优于完整字母表。我们发现,在序列的正确成对比对中,简化字母表的表现可与完整字母表相媲美,并且对具有结构相似性但序列同一性较低的序列对表现出更高的敏感性。基于这些结果,我们推测,对于更复杂的方法,如轮廓和模式搜索,简化字母表可能也会有性能提升。
可从(http://www.rpgroup.caltech.edu/publications/supplements/alphabets)下载本研究的结果表以及替换矩阵和残基分组。