自然界中蛋白质折叠的数量及其在各家族中的分布。

The number of protein folds and their distribution over families in nature.

作者信息

Liu Xinsheng, Fan Ke, Wang Wei

机构信息

National Lab of Solid State Microstructure, Department of Physics and Institute of Biophysics, Nanjing University, Nanjing, China.

出版信息

Proteins. 2004 Feb 15;54(3):491-9. doi: 10.1002/prot.10514.

DOI:10.1002/prot.10514

PMID:14747997

Abstract

Currently, of the 10(6) known protein sequences, only about 10(4) structures have been solved. Based on homologies and similarities, proteins are grouped into different families in which each has a structural prototype, namely, the fold, and some share the same folds. However, the total number of folds and families, and furthermore, the distribution of folds over families in nature, are still an enigma. Here, we report a study on the distribution of folds over families and the total number of folds in nature, using a maximum probability principle and the moment method of estimation. A quadratic relation between the numbers of families and folds is found for the number of families in an interval from 6000 to 30,000. For example, about 2700 folds for 23,100 families are obtained, among them about 33 superfolds, including more than 100 families each, and the largest superfold comprises about 800 families. Our results suggest that although the majority of folds have only a single family per fold, a considerably larger number of folds include many more families each than in the database, and the distribution of folds over families in nature differs markedly from the sampled distribution. The long tail of fold distribution is first estimated in this article. The results fit the data for different versions of the structural classification of proteins (SCOP) excellently, and the goodness-of-fit tests strongly support the results. In addition, the method of directly "enlarging" the sample to the population may be useful in inferring distributions of species in different fields.

摘要

目前，在已知的10⁶个蛋白质序列中，仅有约10⁴个蛋白质的结构已被解析。基于同源性和相似性，蛋白质被归为不同的家族，每个家族都有一个结构原型，即折叠，并且有些家族共享相同的折叠。然而，折叠和家族的总数，以及自然界中折叠在家族间的分布情况，仍然是个谜。在此，我们运用最大概率原理和矩估计方法，报告一项关于自然界中折叠在家族间的分布以及折叠总数的研究。对于家族数量在6000至30000区间内的情况，我们发现家族数量与折叠数量之间存在二次关系。例如，对于23100个家族，我们得到了约2700种折叠，其中约有33种超级折叠，每个超级折叠包含100多个家族，最大的超级折叠包含约800个家族。我们的结果表明，尽管大多数折叠每个仅对应一个家族，但有相当数量的折叠所包含的家族数量比数据库中的要多得多，并且自然界中折叠在家族间的分布与抽样分布明显不同。本文首次估计了折叠分布的长尾情况。结果与不同版本的蛋白质结构分类（SCOP）数据拟合得非常好，拟合优度检验有力地支持了这些结果。此外，将样本直接“扩展”到总体的方法可能有助于推断不同领域中物种的分布情况。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

自然界中蛋白质折叠的数量及其在各家族中的分布。

The number of protein folds and their distribution over families in nature.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

自然界中蛋白质折叠的数量及其在各家族中的分布。

The number of protein folds and their distribution over families in nature.

作者信息

机构信息

出版信息

相似文献

引用本文的文献