Laboratory of Molecular Biology, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
Proteins. 2011 Mar;79(3):853-66. doi: 10.1002/prot.22923. Epub 2010 Dec 22.
Domains are basic units of protein structure and essential for exploring protein fold space and structure evolution. With the structural genomics initiative, the number of protein structures in the Protein Databank (PDB) is increasing dramatically and domain assignments need to be done automatically. Most existing structural domain assignment programs define domains using the compactness of the domains and/or the number and strength of intra-domain versus inter-domain contacts. Here we present a different approach based on the recurrence of locally similar structural pieces (LSSPs) found by one-against-all structure comparisons with a dataset of 6373 protein chains from the PDB. Residues of the query protein are clustered using LSSPs via three different procedures to define domains. This approach gives results that are comparable to several existing programs that use geometrical and other structural information explicitly. Remarkably, most of the proteins that contribute the LSSPs defining a domain do not themselves contain the domain of interest. This study shows that domains can be defined by a collection of relatively small locally similar structural pieces containing, on average, four secondary structure elements. In addition, it indicates that domains are indeed made of recurrent small structural pieces that are used to build protein structures of many different folds as suggested by recent studies.
结构域是蛋白质结构的基本单位,对于探索蛋白质折叠空间和结构进化至关重要。随着结构基因组学计划的推进,蛋白质数据库(PDB)中的蛋白质结构数量正在急剧增加,因此需要自动进行结构域分配。大多数现有的结构域分配程序使用结构域的紧凑性和/或域内与域间接触的数量和强度来定义结构域。在这里,我们提出了一种基于通过与来自 PDB 的 6373 个蛋白质链的数据集进行一对一结构比较找到的局部相似结构片段(LSSP)的重复出现的不同方法。通过三种不同的程序,使用 LSSP 将查询蛋白质的残基聚类以定义结构域。该方法的结果可与使用几何和其他结构信息显式的几种现有程序相媲美。值得注意的是,定义一个结构域的 LSSP 所涉及的大多数蛋白质本身并不包含感兴趣的结构域。这项研究表明,结构域可以通过包含平均四个二级结构元件的相对较小的局部相似结构片段的集合来定义。此外,它表明结构域确实是由重复出现的小结构片段组成的,正如最近的研究表明的那样,这些小结构片段用于构建许多不同折叠的蛋白质结构。