Babnigg György, Joachimiak Andrzej
Midwest Center for Structural Genomics, Biosciences Division, Argonne National Laboratory, 9700 S Cass Ave., Argonne, IL 60439, USA.
J Struct Funct Genomics. 2010 Mar;11(1):71-80. doi: 10.1007/s10969-010-9080-0. Epub 2010 Feb 23.
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein's propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for approximately 720 unique proteins that resulted in X-ray structures. The correlation of the protein's iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein's propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor .
结构基因组学项目开发的高通量结构测定流程为数据挖掘提供了独特的机会。一个重要问题是,从一级序列推导的蛋白质特性如何与蛋白质产生X射线质量晶体的倾向(结晶性)及三维X射线结构相关。针对1300多种表达良好但不溶的蛋白质以及约720种产生了X射线结构的独特蛋白质,计算了一组蛋白质特性。针对蛋白质靶点的全长和结构域构建体,分析了蛋白质的等电点和总平均亲水性(GRAVY)与结晶性的相关性。在第二步中,添加并评估了可从蛋白质序列计算得出的其他几个特性。通过统计分析,我们确定了一组与蛋白质结晶倾向相关的属性,并基于这些属性实现了支持向量机(SVM)分类器。我们创建了应用程序,用于分析查询序列并提供最佳边界信息,以及可视化数据。这些工具可通过网站http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor获取。