Fischer D, Tsai C J, Nussinov R, Wolfson H
Computer Science Department, School of Mathematical Sciences, Tel Aviv University, Israel.
Protein Eng. 1995 Oct;8(10):981-97. doi: 10.1093/protein/8.10.981.
Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the "structural universe' is populated.
在此,我们探讨以下问题。蛋白质数据库(PDB)中有多少种结构不同的条目?蛋白质是如何分布在结构空间中的?为了研究这些问题,我们从PDB中挑选了一组结构上非冗余的代表性条目。构建这样一个数据集并非易事:(i)PDB规模庞大,需要进行大量比较(1994年5月有超过3250个蛋白质链结构);(ii)PDB高度冗余,包含许多结构相似的条目,这些条目不一定具有显著的序列同源性;(iii)结构相似性没有明确的定义,这取决于所使用的标准和方法。在此,我们在忽略蛋白质拓扑结构的情况下分析结构相似性。迄今为止,代表性数据集要么是手动挑选的,要么是通过忽略蛋白质三维(3D)结构的序列比较技术挑选的,要么是通过先进行序列比较再进行线性结构比较(即在结构比较中强制考虑拓扑结构或链的顺序)挑选的。在此,我们描述一种与序列无关的3D自动化高效方法,从PDB中获取一组包含所有独特结构且结构上非冗余的蛋白质分子代表性集。该方法有两个新特点。第一个特点是在选择过程中使用严格的结构标准,而不考虑序列信息。为此,我们采用一种快速结构比较算法,在工作站上平均每对比较大约需要2秒。第二个新特点是迭代应用启发式聚类算法,这大大减少了所需的比较次数。我们获得了一组220条分辨率优于3.0 Å的链的代表性集,或者包括低分辨率条目、NMR条目和模型在内的268条链的代表性集。所得数据集可作为广泛的结构分类以及3D重复基序和序列 - 结构关系研究的基础。聚类算法成功地将没有显著序列同源性的链分类到同一结构家族中,例如所有球蛋白归为一组,所有胰蛋白酶样丝氨酸蛋白酶归为另一组,或者所有免疫球蛋白样折叠归为第三组。此外,还自动检测到链对之间有趣的意外结构相似性。对代表性结构的聚类分析展示了“结构空间”的填充方式。