Wang Huilin, Wang Mingjun, Tan Hao, Li Yuan, Zhang Ziding, Song Jiangning
National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.
Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia.
PLoS One. 2014 Aug 22;9(8):e105902. doi: 10.1371/journal.pone.0105902. eCollection 2014.
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.
X射线晶体学是解析蛋白质三维结构的主要方法。然而,该方法的一个主要瓶颈是多步实验过程未能得到具有衍射质量的晶体,这些步骤包括序列克隆、蛋白质材料制备、纯化、结晶以及最终的结构测定。因此,基于蛋白质序列预测蛋白质成功完成这些实验过程的倾向,可能有助于减少繁琐的实验工作并促进靶点选择。为此,已经开发了许多基于蛋白质序列信息的生物信息学方法。然而,我们对蛋白质序列产生高质量衍射晶体倾向的重要决定因素的了解仍然非常不完整。实际上,当在更大和更新的数据集上进行评估时,大多数现有方法的表现较差。为了解决这个问题,我们构建了一个最新的数据集作为基准,随后使用支持向量机(SVM)开发了一种名为“PredPPCrys”的新方法。通过结合一组全面的多方面序列衍生特征以及一种新颖的多步特征选择策略,我们确定并表征了每种特征类型对成功结晶所需的五个单独实验步骤的预测性能的相对重要性和贡献。由此得到的最优候选特征被用作构建一级SVM预测器(PredPPCrys I)的输入。接下来,PredPPCrys I的预测输出被用作构建二级SVM分类器(PredPPCrys II)的输入,这显著提高了预测性能。基准实验表明,我们的PredPPCrys方法在最新数据集和先前数据集上均优于大多数现有方法。此外,还提供了当前不可结晶蛋白质的预测结晶靶点作为汇总数据,预计这将有助于全球结构基因组学联盟进行靶点选择和设计。PredPPCrys可在http://www.structbioinfor.org/PredPPCrys上免费获取。