Liu Jinfeng, Hegyi Hedi, Acton Thomas B, Montelione Gaetano T, Rost Burkhard
CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.
Proteins. 2004 Aug 1;56(2):188-200. doi: 10.1002/prot.20012.
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.
结构基因组学的一个核心目标是通过实验确定所有蛋白质家族的代表性结构。目前至少有14个结构基因组学试点项目正在研究高通量结构测定的可行性;美国国立卫生研究院资助了其中的9个项目。不同的项目在它们所关注的“所有家族”的特定子集中存在差异。在东北结构基因组学联盟(NESG),我们的目标是真核生物蛋白质结构域家族。自动靶标选择程序有三个目标:1)基于序列同源性,从目前已完全测序的五个真核生物靶标生物体中识别出所有蛋白质结构域家族;2)舍弃那些可以根据蛋白质数据银行(PDB)中已有的结构信息进行建模的家族;3)将其余家族的代表作为结构测定的靶标。为了确保一个家族的所有成员都共享一个共同的折叠样区域,我们必须在聚类之前先将蛋白质分解为结构域样区域。我们的分层方法CHOP利用与PrISM、Pfam-A和SWISS-PROT的同源性,将103,796个真核生物蛋白质/开放阅读框(ORF)切割成247,222个片段。在这些片段中,122,999个似乎是合适的靶标,它们被分组为>27,000个单例和>18,000个多片段簇。因此,我们的结果表明,可能有必要测定>40,000个结构,以最少覆盖五个真核生物蛋白质组的子集。