Sillitoe Ian, Dibley Mark, Bray James, Addou Sarah, Orengo Christine
Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, UK.
Protein Sci. 2005 Jul;14(7):1800-10. doi: 10.1110/ps.041056105. Epub 2005 Jun 3.
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.
公共数据库中有200多个已完成的基因组和超过100万个非冗余序列。尽管结构数据更为稀少(迄今已解析出约13000个非冗余结构),但现在有几种强大的基于序列的方法可将这些结构映射到相当一部分基因组序列的相关区域。我们综述了一些为基因组序列提供结构注释的公开可用策略,并描述了为已完成基因组提供CATH结构注释所采用的方案。特别是,我们评估了几种采用隐马尔可夫模型(HMM)技术进行超家族识别的基于序列的方案的性能,包括一种新方法(SAMOSA [结构比对的序列增强模型]),该方法在构建模型时利用了来自CATH结构域数据库的多个结构比对。使用通过结构比较检测并在CATH中手动验证的远程同源物数据集,单种子HMM库能够识别该数据集的76%。将SAMOSA模型纳入HMM库在同源物识别方面几乎没有提高,尽管对于非常远程的同源物,比对质量略有改善。然而,使用扩展的一维HMM库,CATH-ISL将覆盖率提高到了86%。单种子HMM库已用于注释来自所有三个主要生物界的120个基因组的蛋白质序列,使多达70%的基因或部分基因能够被指定到CATH超家族。它还被用于将来自Swiss-Prot和TrEMBL的序列招募到CATH结构域超家族中,使CATH数据库扩大了八倍。