Brenner S E, Chothia C, Hubbard T J
MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom.
Proc Natl Acad Sci U S A. 1998 May 26;95(11):6073-8. doi: 10.1073/pnas.95.11.6073.
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
成对序列比较方法已使用其结构和功能关系在SCOP数据库[Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536 - 540]中有可靠记载的蛋白质进行了评估。该评估测试了程序BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403 - 410]、WU - BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460 - 480]、FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444 - 2448]和SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195 - 197]及其评分方案。通过使用统计分数来评估匹配而非百分比一致性或原始分数,所有算法的错误率都大幅降低。SSEARCH和FASTA的E值统计分数是可靠的:我们测试中发现的假阳性数量与报告的分数吻合良好。然而,BLAST和WU - BLAST2报告的P值将显著性夸大了几个数量级。SSEARCH、FASTA ktup = 1和WU - BLAST2表现最佳,它们能够检测出序列一致性>30%的蛋白质之间几乎所有的关系。对于关系更远的蛋白质,它们的表现要差得多;在序列一致性为20 - 30%的蛋白质之间,只能发现一半的关系。由于许多同源物具有低序列相似性,任何成对比较方法都无法检测到大多数远缘关系;然而,那些被识别出的关系可以放心使用。