Orengo C A, Michie A D, Jones S, Jones D T, Swindells M B, Thornton J M
Department of Biochemistry and Molecular Biology, University College London, UK.
Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8.
Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures.
We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily.
Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well-characterised protein structure families, such as CATH, will facilitate the assignment of structure-function/evolution relationships to both known and newly determined protein structures.
蛋白质进化产生了结构相关的蛋白质家族,其中序列同一性可能极低。因此,基于结构的分类在识别已知结构中未预料到的关系时可能很有效,在最佳情况下还可以确定其功能。已知蛋白质结构的数量不断增加,规模太大以至于无法手动对所有蛋白质进行分类,因此需要自动方法来快速评估蛋白质结构。
我们提出了一种半自动程序,用于推导蛋白质结构域结构的新型层次分类(CATH)。我们分类的四个主要层次是蛋白质类(C)、结构(A)、拓扑结构(T)和同源超家族(H)。类是最简单的层次,它本质上描述了每个结构域的二级结构组成。相比之下,结构总结了二级结构单元(如桶状和三明治状)的取向所揭示的形状。在拓扑结构层次上,考虑序列连接性,因此相同结构的成员可能具有相当不同的拓扑结构。当属于同一T层次的结构具有足够高的相似性且功能相似时,这些蛋白质被认为在进化上相关,并被归入同一个同源超家族。
对CATH生成的结构家族的分析揭示了蛋白质结构空间的显著特征。我们发现,近三分之一的同源超家族(H层次)属于十个主要的T层次,我们称之为超折叠,此外,这些H层次中近三分之二聚集为九种简单结构。一个特征明确的蛋白质结构家族数据库,如CATH,将有助于确定已知和新确定的蛋白质结构的结构-功能/进化关系。