Liu Jinfeng, Rost Burkhard
Department of Pharmacology, Columbia University, 630 West 168th Street, New York, NY 10032, USA.
Bioinformatics. 2002 Jul;18(7):922-33. doi: 10.1093/bioinformatics/18.7.922.
Structural genomics eventually aims at determining structures for all proteins. However, in the beginning experimentalists are likely to focus on globular proteins to achieve a rapid basic coverage of protein sequence space. How many proteins will structural genomics have to target? How many proteins will be excluded since we already have structural information for these or since they are not globular? We have to answer these questions in the context of our target selection for the North-East Structural Genomics Consortium (NESG).
We estimated that structural information is available for about 6-38% of all proteins; 6% if we require high accuracy in comparative modelling, 38% if we are satisfied with having a rough idea about the fold. Excluding all regions that are not globular, we found that structural genomics may have to target about 48% of all proteins. This corresponded to a similar percentage of residues of the entire proteomes (52%). We explored a number of different strategies to cluster protein space in order to find the number of families representing these 48% of structurally unknown proteins. For the subset of all entirely sequenced eukaryotes, we found over 18 000 fragment clusters each of which may be a suitable target for structural genomics.
All data are available from the authors, most results are summarized at: http://cubic.bioc.columbia.edu/genomes/RES/2002_bioinformatics/
结构基因组学最终旨在确定所有蛋白质的结构。然而,在初期,实验人员可能会专注于球状蛋白质,以便快速实现对蛋白质序列空间的基本覆盖。结构基因组学需要针对多少种蛋白质?又有多少蛋白质会因为我们已经掌握其结构信息或因其不是球状蛋白质而被排除在外?我们必须在为东北结构基因组学联盟(NESG)进行目标选择的背景下回答这些问题。
我们估计,约6% - 38%的蛋白质已有结构信息;若在比较建模中要求高精度,则为6%;若仅满足于对折叠有大致了解,则为38%。排除所有非球状区域后,我们发现结构基因组学可能需要针对约48%的蛋白质。这与整个蛋白质组中相应的残基百分比(52%)相近。我们探索了多种不同策略来对蛋白质空间进行聚类,以找出代表这48%结构未知蛋白质的家族数量。对于所有已完全测序的真核生物子集,我们发现了超过18000个片段簇,每个簇都可能是结构基因组学的合适目标。
所有数据可向作者索取,大多数结果总结于:http://cubic.bioc.columbia.edu/genomes/RES/2002_bioinformatics/