Portugaly Elon, Kifer Ilona, Linial Michal
Institute of Computer Sciences Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
Bioinformatics. 2002 Jul;18(7):899-907. doi: 10.1093/bioinformatics/18.7.899.
A major goal in structural genomics is to enrich the catalogue of proteins whose 3D structures are known. In an attempt to address this problem we mapped over 10 000 proteins with solved structures onto a graph of all Swissprot protein sequences (release 36, approximately 73 000 proteins) provided by ProtoMap, with the goal of sorting proteins according to their likelihood of belonging to new superfamilies. We hypothesized that proteins within neighbouring clusters tend to share common structural superfamilies or folds. If true, the likelihood of finding new superfamilies increases in clusters that are distal from other solved structures within the graph.
We defined an order relation between unsolved proteins according to their 'distance' from solved structures in the graph, and sorted approximately 48 000 proteins. Our list can be partitioned into three groups: approximately 35 000 proteins sharing a cluster with at least one known structure; approximately 6500 proteins in clusters with no solved structure but with neighbouring clusters containing known structures; and a third group contains the rest of the proteins, approximately 6100 (in 1274 clusters). We tested the quality of the order relation using thousands of recently solved structures that were not included when the order was defined. The tests show that our order is significantly better (P-value approximately 10(5)) than a random order. More interestingly, the order within the union of the second and third groups, and the order within the third group alone, perform better than random (P-values: 0.0008 and 0.15, respectively) and are better than alternative orders created using PSI-BLAST. Herein, we present a method for selecting targets to be used in structural genomics projects.
List of proteins to be used for targets selection combined with a set of biological filters for narrowing down potential targets is in http://www.protarget.cs.huji.ac.il.
结构基因组学的一个主要目标是丰富已知三维结构的蛋白质目录。为了解决这个问题,我们将一万多种已解析结构的蛋白质映射到由ProtoMap提供的所有Swissprot蛋白质序列(第36版,约73000种蛋白质)构成的图上,目的是根据蛋白质属于新超家族的可能性对其进行分类。我们假设相邻簇内的蛋白质往往共享共同的结构超家族或折叠方式。如果这是真的,那么在图中远离其他已解析结构的簇中发现新超家族的可能性就会增加。
我们根据未解析蛋白质在图中与已解析结构的“距离”定义了一种顺序关系,并对约48000种蛋白质进行了分类。我们的列表可分为三组:约35000种蛋白质与至少一个已知结构共享一个簇;约6500种蛋白质所在的簇中没有已解析结构,但相邻簇中有已知结构;第三组包含其余的蛋白质,约6100种(分布在1274个簇中)。我们使用数千个在定义顺序时未包含的最近解析的结构来测试顺序关系的质量。测试表明,我们的顺序比随机顺序显著更好(P值约为10^(-5))。更有趣的是,第二组和第三组联合起来的顺序以及仅第三组内的顺序比随机顺序表现更好(P值分别为0.0008和0.15),并且比使用PSI-BLAST创建的替代顺序更好。在此,我们提出了一种用于选择结构基因组学项目中目标的方法。
用于目标选择的蛋白质列表以及一组用于缩小潜在目标范围的生物学筛选条件可在http://www.protarget.cs.huji.ac.il获取。