Dengler U, Siddiqui A S, Barton G J
EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
Proteins. 2001 Feb 15;42(3):332-44.
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.
3Dee结构域定义数据库是作为蛋白质数据库(PDB)中所有三维结构的结构域定义的全面集合而开发的。该数据库包括复杂、多片段和多链结构域以及简单连续结构域的定义,并按结构层次进行组织。在1996年9月和1999年11月对3Dee数据库的两个不同快照进行了分析。对于1999年11月发布的版本,7995个PDB条目包含13767条蛋白质链,并产生了18896个结构域。结构域序列聚集成1715个结构域序列家族,这些家族进一步聚集成1199个保守的结构域结构家族(具有相似折叠的家族)。每个结构域序列家族中不同结构域结构家族的比例从长度为1 - 100个残基的结构域的84%增加到大于600个残基的结构域的100%。这与较长的链将有更多可供选择的折叠方式的观点一致。在结构域序列家族的代表性结构域中,49%在51 - 150个残基范围内,而超过200个残基的代表性链中有64%具有不止一个结构域。在代表性链中,8.5%是多链结构域的一部分。数据库中最大的多链结构域有14条链和1400个残基,而最大的单链结构域有907个残基。在一种蛋白质中发现的结构域的最大数量是13个。分析表明,在PDB的历史中,新的结构域折叠的发现速度比通过随机选择所有已知折叠的速度要慢。在1992年至1997年期间,存入PDB的每11个新结构域中就有1个与先前已知的结构域序列家族没有序列相似性,并且每15个新结构域结构中只有1个具有以前未见过的折叠方式。将1996年9月发布的3Dee与蛋白质结构分类(SCOP)进行比较表明,对于80%的代表性蛋白质链,结构域定义是一致的。然而,3Dee为更多蛋白质提供了明确的结构域边界。可通过万维网在http://barton.ebi.ac.uk/servers/3Dee.html上访问3Dee。