Schäffer A A, Aravind L, Madden T L, Shavirin S, Spouge J L, Wolf Y I, Koonin E V, Altschul S F
National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994.
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
PSI-BLAST是一个迭代程序,用于在数据库中搜索与查询序列具有远源相似性的蛋白质。我们研究了对PSI-BLAST中使用的方法进行的十几种修改,目的是提高找到真正阳性匹配的准确性。为了评估性能,我们使用了一组103个查询,其中酵母中的真正阳性已由人类专家注释,以及一种流行的检索准确性度量(ROC),其可以标准化以取值在0(最差)和1(最佳)之间。我们认为新颖的修改将ROC分数从0.758±0.005提高到0.895±0.003。这还不包括我们在“基线”版本中包含的四项修改所带来的好处,尽管它们未在PSI-BLAST 2.0版本中实现。在第二个小测试集上证实了准确性的提高。该测试涉及分析三个蛋白质家族,这些家族具有来自非冗余蛋白质数据库的经策划的真正阳性列表。占改进大部分的修改是对每个数据库序列使用根据该序列的氨基酸组成调整的位置特异性评分系统。基于组成的统计数据的使用对于PSI-BLAST的大规模自动化应用特别有益。