Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C
MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
J Mol Biol. 1998 Dec 11;284(4):1201-10. doi: 10.1006/jmbi.1998.2221.
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.
相关蛋白质的序列可能会发生分化,以至于通过两两序列比对无法识别它们之间的关系。为了克服这一局限性,人们开发了一些方法,这些方法使用的查询不是单个序列,而是相关序列集或相关序列共享特征的表示。在这里,我们描述了对其中三种方法的评估:一种隐藏马尔可夫模型程序的SAM-T98实现;PSI-BLAST;以及中间序列搜索(ISS)程序。我们确定了这些程序能够检测序列数据库PDBD40-J成员之间进化关系 的程度。这个数据库源自蛋白质结构分类(SCOP),包含已知结构蛋白质的序列,这些序列彼此之间的序列同一性为40%或更低。通过检查它们的结构细节,在许多情况下还包括它们的功能特征,发现了那些具有低序列同一性 的蛋白质之间存在的进化关系。在可能的432,680个预测中,有9个假阳性预测,即假阳性率约为1/50,000,SAM-T98发现了PDBD40-J中35%的真正同源关系,而PSI-BLAST发现了30%,ISS发现了25%。总体而言,这大约是两两比对程序FASTA(17%)和GAP-BLAST(15%)能够检测到的PDBD40-J关系数量的两倍。对于PDBD40-J中关系较远的序列,即那些序列同一性小于30%的序列对,SAM-T98和PSI-BLAST检测到的关系数量是两两比对方法的三倍。