Linial Michal
Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
Comp Funct Genomics. 2003;4(5):542-8. doi: 10.1002/cfg.328.
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have not been determined, but that represent an as yet unknown structural superfamily or fold. Once such a structure is solved, it can be used as a template for modelling homologous proteins. This will aid in unveiling the structural diversity of the protein space. Currently, no reliable method for accurate 3D structural prediction is available when a sequence or a structure homologue is not available. Here we present a systematic methodology for selecting target proteins whose structure is likely to adopt a new, as yet unknown superfamily or fold. Our method takes advantage of a global classification of the sequence space as presented by ProtoNet-3D, which is a hierarchical agglomerative clustering of the proteins of interest (the proteins in Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet structurally solved, along with the probability of each of the proteins belonging to a new superfamily or fold. The sorted list has been self-validated against real structural data that was not available when the predictions were made. The practical application of using our computational-statistical method to determine novel superfamilies for structural genomics projects is also discussed.
结构基因组学致力于描绘整个蛋白质空间。实现这一目标的第一步是合理选择那些结构尚未确定,但代表着一个未知结构超家族或折叠的蛋白质。一旦解析出这样一个结构,它就可以用作建模同源蛋白质的模板。这将有助于揭示蛋白质空间的结构多样性。目前,当没有序列或结构同源物时,尚无可靠的方法进行精确的三维结构预测。在此,我们提出一种系统方法,用于选择其结构可能采用新的、未知超家族或折叠的目标蛋白质。我们的方法利用了ProtoNet-3D给出的序列空间全局分类,它是感兴趣的蛋白质(Swiss-Prot中的蛋白质)与所有已解析结构(取自蛋白质数据库)的层次凝聚聚类。通过在ProtoNet-3D的框架中导航,我们得到一份尚未进行结构解析的蛋白质的优先级列表,以及每个蛋白质属于新超家族或折叠的概率。该排序后的列表已根据预测时不可用的真实结构数据进行了自我验证。还讨论了使用我们的计算统计方法为结构基因组学项目确定新超家族的实际应用。