蛋白质相似性搜索中的灵敏度与选择性：硬件实现的史密斯-沃特曼算法与BLAST和FASTA的比较

Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA.

作者信息

Shpaer E G, Robinson M, Yee D, Candlin J D, Mines R, Hunkapiller T

机构信息

Perkin-Elmer, Applied Biosystems Division, Foster City, California 94404, USA.

出版信息

Genomics. 1996 Dec 1;38(2):179-91. doi: 10.1006/geno.1996.0614.

DOI:10.1006/geno.1996.0614

PMID:8954800

Abstract

To predict the functions of a possible protein product of any new or uncharacterized DNA sequence, it is important first to detect all significant similarities between the encoded amino acid sequence and any accumulated protein sequence data. We have implemented a set of queries and database sequences and proceeded to test and compare various similarity search methods and their parameterizations. We demonstrate here that the Smith-Waterman (S-W) dynamic programming method and the optimized version of FASTA are significantly better able to distinguish true similarities from statistical noise than is the popular database search tool BLAST. Also, a simple "log-length normalization" of S-W scores based on the query and target sequence lengths greatly increased the selectivity of the S-W searches, exceeding the default normalization method of FASTA. An implementation of the modified S-W algorithm in hardware (the Fast Data Finder) is able to match the accuracy of software versions while greatly speeding up its execution. We present here the selectivity and sensitivity data from these tests as well as results for various scoring matrices. We present data that will help users to choose threshold score values for evaluation of database search results. We also illustrate the impact of using simple-sequence masking tools such as SEG or XNU.

摘要

为了预测任何新的或未表征的DNA序列可能的蛋白质产物的功能，首先检测编码的氨基酸序列与任何积累的蛋白质序列数据之间的所有显著相似性非常重要。我们实现了一组查询序列和数据库序列，并着手测试和比较各种相似性搜索方法及其参数设置。我们在此证明，与流行的数据库搜索工具BLAST相比，史密斯-沃特曼（S-W）动态规划方法和优化版的FASTA能更好地从统计噪声中区分出真正的相似性。此外，基于查询序列和目标序列长度对S-W得分进行简单的“对数长度归一化”，极大地提高了S-W搜索的选择性，超过了FASTA的默认归一化方法。硬件实现的改进型S-W算法（快速数据查找器）在大大加快执行速度的同时，能够达到软件版本的准确性。我们在此展示这些测试的选择性和敏感性数据以及各种评分矩阵的结果。我们提供的数据将帮助用户选择用于评估数据库搜索结果的阈值分数值。我们还说明了使用SEG或XNU等简单序列屏蔽工具的影响。