Ladunga I, Wiese B A, Smith R F
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
J Mol Biol. 1996 Jun 21;259(4):840-54. doi: 10.1006/jmbi.1996.0362.
We introduce two new pattern database search tools that utilize statistical significance and information theory to improve protein function identification. Both the general pattern scoring theory with the specific matrices introduced here and the low redundancy of pattern databases increase search sensitivity and selectivity. Pattern scoring preferentially rewards matches at conserved positions in a pattern with higher scores than matches at variable positions, and assigns more negative scores to mismatches at conserved positions than to mismatches at variable positions. The theory of pattern scoring can be used to create log-odds pattern scores for patterns derived from any set of multiple alignments. This theoretical framework can be used to adapt existing sequence database search tools to pattern analysis. Our FASTA-SWAP and FASTA-PAT tools are extensions of the FASTA program that search a sequence query against a pattern database. In the first step, FASTA-SWAP searches the diagonals of the query sequence and the library pattern for high-scoring segments, while FASTA-PAT performs an extended version of hashing. In the second step, both methods refine the alignments and the scores using dynamic programming. The tools utilize an extremely compact binary representation of all possible combinations of amino acid residues in aligned positions. Our FASTA-SWAP and FASTA-PAT tools are well suited for functional identification of distant relatives that may be missed by sequence database search methods. FASTA-SWAP and FASTA-PAT searches can be performed using our World-Wide Web Server (http://dot.imgen.bcm.tmc.edu:9331/seq-search/Op tions/fastapat.html).
我们介绍了两种新的模式数据库搜索工具,它们利用统计学显著性和信息论来改进蛋白质功能识别。本文介绍的通用模式评分理论以及特定矩阵,再加上模式数据库的低冗余性,提高了搜索的灵敏度和选择性。模式评分优先奖励模式中保守位置的匹配,其得分高于可变位置的匹配,并且给保守位置的错配分配比可变位置的错配更多的负分数。模式评分理论可用于为从任何多序列比对集合中导出的模式创建对数似然模式分数。这个理论框架可用于使现有的序列数据库搜索工具适应模式分析。我们的FASTA-SWAP和FASTA-PAT工具是FASTA程序的扩展,用于在模式数据库中搜索序列查询。第一步,FASTA-SWAP在查询序列和库模式的对角线上搜索高分片段,而FASTA-PAT执行哈希的扩展版本。第二步,两种方法都使用动态规划来优化比对和分数。这些工具利用对齐位置中氨基酸残基所有可能组合的极其紧凑的二进制表示。我们的FASTA-SWAP和FASTA-PAT工具非常适合功能识别可能被序列数据库搜索方法遗漏的远亲。可以使用我们的万维网服务器(http://dot.imgen.bcm.tmc.edu:9331/seq-search/Op tions/fastapat.html)进行FASTA-SWAP和FASTA-PAT搜索。