Wolf Y I, Brenner S E, Bash P A, Koonin E V
National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA.
Genome Res. 1999 Jan;9(1):17-26.
A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. A collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced. In the completely sequenced genomes, folds could be automatically identified for 20%-30% of the proteins, with 3%-6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes. Within the bacteria, this distribution differs between parasitic and free-living species. In all analyzed genomes, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are ferredoxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, beta-propellers and TIM-barrels. The observed diversity of protein folds in different proteomes is approximately twice as high as it would be expected from a simple stochastic model describing a proteome as a finite sample from an infinite pool of proteins with an exponential distribution of the fold fractions. Distribution of the number of domains with different folds in one protein fits the geometric model, which is compatible with the evolution of multidomain proteins by random combination of domains. [Fold predictions for proteins from 14 proteomes are available on the World Wide Web at. The FIDs are available by anonymous ftp at the same location.]
基于使用PSI-BLAST程序的迭代数据库搜索,开发了一种灵敏的蛋白质折叠识别程序。生成了一组1193个位置依赖权重矩阵,可作为折叠标识符。在完全测序的基因组中,20%-30%的蛋白质的折叠可被自动识别,通过对保守基序的额外分析,还能多识别出3%-6%的蛋白质折叠。细菌和古生菌中最常见折叠的分布非常相似,但在真核生物中则不同。在细菌内部,寄生和自由生活物种之间的这种分布也有所不同。在所有分析的基因组中,P环NTP酶是最丰富的折叠类型。在细菌和古生菌中,接下来最常见的折叠类型是铁氧化还原蛋白样结构域、TIM桶和甲基转移酶,而在真核生物中,第二至第四常见的折叠类型是蛋白激酶、β螺旋桨和TIM桶。在不同蛋白质组中观察到的蛋白质折叠多样性大约是简单随机模型预期值的两倍,该随机模型将蛋白质组描述为从具有折叠分数指数分布的无限蛋白质库中抽取的有限样本。一个蛋白质中具有不同折叠的结构域数量分布符合几何模型,这与多结构域蛋白质通过结构域随机组合的进化方式相一致。[来自14个蛋白质组的蛋白质折叠预测可在万维网上获取。FIDs可通过匿名ftp在同一位置获取。]