Gough J, Karplus K, Hughey R, Chothia C
Laboratory of Molecular Biology, MRC, Hills Road, Cambridge, CB2 2QH, UK.
J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.
Of the sequence comparison methods, profile-based methods perform with greater selectively than those that use pairwise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HMMs and (ii) determine a good procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage. The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library.
在序列比较方法中,基于图谱的方法比使用成对比较的方法具有更高的选择性。在图谱方法中,隐马尔可夫模型(HMM)显然是最好的。本文的第一部分描述了一些计算方法,这些方法(i)提高了HMM的性能,(ii)确定了为已知结构的蛋白质序列创建HMM的良好程序。对于一组相关蛋白质,使用从不同单种子序列构建的多个模型比使用从这些序列的良好比对构建的一个模型能检测到更多的同源物。本文描述了一种新的程序,用于检测和纠正该程序在模型构建阶段出现的错误。这两项改进极大地提高了选择性和覆盖率。本文的第二部分描述了一个称为SUPERFAMILY的HMM库的构建,该库代表了基本上所有已知结构的蛋白质。已知结构的蛋白质中同一性小于95%的结构域序列被用作种子来构建模型。根据当前数据,这产生了一个包含4894个模型的库。本文的第三部分描述了使用SUPERFAMILY模型库对50多个基因组的序列进行注释。这些模型匹配的目标序列数量是成对序列比较方法匹配数量的两倍。对于每个基因组,近一半的序列全部或部分被匹配,总体而言,这些匹配覆盖了35%的真核生物基因组和45%的细菌基因组。平均而言,大约15%的基因组序列被标记为假设的,但与已知结构的蛋白质同源。这些匹配产生的注释可从公共网络服务器获取:http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY。该服务器还允许用户将自己的序列与SUPERFAMILY模型库进行匹配。