Jahandideh Samad, Jaroszewski Lukasz, Godzik Adam
Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA.
Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.
Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007), Protein Sci. 16, 2472-2482] was developed. XtalPred classifies proteins into five `crystallization classes' based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.
获得具有衍射质量的晶体仍然是结构生物学中的主要瓶颈之一。从蛋白质的氨基酸序列预测结晶可能性的能力,至少在一定程度上可以解决这个问题,方法是让晶体学家选择更有可能成功的同源物和/或修改目标序列,以避免不利于成功结晶的特征。2007年,现在广泛使用的XtalPred算法[斯拉宾斯基等人(2007年),《蛋白质科学》16卷,2472 - 2482页]被开发出来。XtalPred基于对蛋白质物理化学特征的简单统计分析,将蛋白质分为五个“结晶类别”。在此,为了实现同样的目标,应用了先进的机器学习方法,此外,还测试了其他蛋白质特征的预测潜力,如预测的表面粗糙度、疏水性、表面残基的侧链熵以及预测的蛋白质表面的氨基酸组成。新的XtalPred - RF(随机森林)在结晶成功预测方面比原始的XtalPred有显著改进。为了说明这一点,通过重新审视结构基因组学联合中心(JCSG)在PSI - 2中针对的271个Pfam家族的目标选择来测试XtalPred - RF,据估计,进入蛋白质生产和结晶流程的目标数量可以减少30%,而不会减少首次解析出结构的家族数量。预测改进取决于用作测试集的目标子集,对于预测目标的顶级类别,预测改进达到100%(即两倍)。