Gough Julian
Unite de Bioinformatique Structurale, Institut Pasteur, 25-28 Rue du Docteur Roux, 75724 Paris Cedex 15, Paris, France.
Nucleic Acids Res. 2006 Jul 28;34(13):3625-33. doi: 10.1093/nar/gkl484. Print 2006.
Many classification schemes for proteins and domains are either hierarchical or semi-hierarchical yet most databases, especially those offering genome-wide analysis, only provide assignments to sequences at one level of their hierarchy. Given an established hierarchy, the problem of assigning new sequences to lower levels of that existing hierarchy is less hard (but no less important) than the initial top level assignment which requires the detection of the most distant relationships. A solution to this problem is described here in the form of a new procedure which can be thought of as a hybrid between pairwise and profile methods. The hybrid method is a general procedure that can be applied to any pre-defined hierarchy, at any level, including in principle multiple sub-levels. It has been tested on the SCOP classification via the SUPERFAMILY database and performs significantly better than either pairwise or profile methods alone. Perhaps the greatest advantage of the hybrid method over other possible approaches to the problem is that within the framework of an existing profile library, the assignments are fully automatic and come at almost no additional computational cost. Hence it has already been applied at the SCOP family level to all genomes in the SUPERFAMILY database, providing a wealth of new data to the biological and bioinformatics communities.
许多针对蛋白质和结构域的分类方案要么是层次型的,要么是半层次型的,但大多数数据库,尤其是那些提供全基因组分析的数据库,只在其层次结构的一个级别上提供序列的分类。给定一个既定的层次结构,将新序列分配到该现有层次结构的较低级别这个问题,比起需要检测最远距离关系的初始顶级分配来说,难度要小一些(但同样重要)。本文描述了一种解决这个问题的方法,它采用了一种新程序的形式,可以看作是成对方法和profile方法的混合。这种混合方法是一种通用程序,可以应用于任何预定义的层次结构的任何级别,原则上包括多个子级别。它已经通过SUPERFAMILY数据库在SCOP分类上进行了测试,并且比单独的成对方法或profile方法表现得要好得多。与解决这个问题的其他可能方法相比,混合方法最大的优势可能在于,在现有profile库的框架内,分类是完全自动的,而且几乎不需要额外的计算成本。因此,它已经在SUPERFAMILY数据库中应用于SCOP家族级别下的所有基因组,为生物学和生物信息学领域提供了大量新数据。