Yu Yi-Kuo, Gertz E Michael, Agarwala Richa, Schäffer Alejandro A, Altschul Stephen F
National Center for Biotechnology Information, National Library of Medicine, NIH, DHHS, Bethesda, MD 20894, USA.
Nucleic Acids Res. 2006;34(20):5966-73. doi: 10.1093/nar/gkl731. Epub 2006 Oct 26.
Protein sequence database search programs may be evaluated both for their retrieval accuracy--the ability to separate meaningful from chance similarities--and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.
蛋白质序列数据库搜索程序可以从检索准确性(即区分有意义的相似性和随机相似性的能力)以及对所报告比对的统计评估准确性这两方面进行评估。然而,提高统计准确性的方法可能会通过舍弃序列相关性的组成证据而降低检索准确性。通过将比对和组成相似性这两个基本独立的度量合并为一个统一的序列相似性度量,可以保留这一证据。对BLAST蛋白质数据库搜索程序的一个版本进行修改,使其采用这种新度量,在基于SCOP的测试集ASTRAL上,该版本在检索准确性和统计准确性方面均优于基线程序。