Shpaer E G, Robinson M, Yee D, Candlin J D, Mines R, Hunkapiller T
Perkin-Elmer, Applied Biosystems Division, Foster City, California 94404, USA.
Genomics. 1996 Dec 1;38(2):179-91. doi: 10.1006/geno.1996.0614.
To predict the functions of a possible protein product of any new or uncharacterized DNA sequence, it is important first to detect all significant similarities between the encoded amino acid sequence and any accumulated protein sequence data. We have implemented a set of queries and database sequences and proceeded to test and compare various similarity search methods and their parameterizations. We demonstrate here that the Smith-Waterman (S-W) dynamic programming method and the optimized version of FASTA are significantly better able to distinguish true similarities from statistical noise than is the popular database search tool BLAST. Also, a simple "log-length normalization" of S-W scores based on the query and target sequence lengths greatly increased the selectivity of the S-W searches, exceeding the default normalization method of FASTA. An implementation of the modified S-W algorithm in hardware (the Fast Data Finder) is able to match the accuracy of software versions while greatly speeding up its execution. We present here the selectivity and sensitivity data from these tests as well as results for various scoring matrices. We present data that will help users to choose threshold score values for evaluation of database search results. We also illustrate the impact of using simple-sequence masking tools such as SEG or XNU.
为了预测任何新的或未表征的DNA序列可能的蛋白质产物的功能,首先检测编码的氨基酸序列与任何积累的蛋白质序列数据之间的所有显著相似性非常重要。我们实现了一组查询序列和数据库序列,并着手测试和比较各种相似性搜索方法及其参数设置。我们在此证明,与流行的数据库搜索工具BLAST相比,史密斯-沃特曼(S-W)动态规划方法和优化版的FASTA能更好地从统计噪声中区分出真正的相似性。此外,基于查询序列和目标序列长度对S-W得分进行简单的“对数长度归一化”,极大地提高了S-W搜索的选择性,超过了FASTA的默认归一化方法。硬件实现的改进型S-W算法(快速数据查找器)在大大加快执行速度的同时,能够达到软件版本的准确性。我们在此展示这些测试的选择性和敏感性数据以及各种评分矩阵的结果。我们提供的数据将帮助用户选择用于评估数据库搜索结果的阈值分数值。我们还说明了使用SEG或XNU等简单序列屏蔽工具的影响。