Kurgan Lukasz, Razib Ali A, Aghakhani Sara, Dick Scott, Mizianty Marcin, Jahandideh Samad
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.
BMC Struct Biol. 2009 Jul 31;9:50. doi: 10.1186/1472-6807-9-50.
Current protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality.
A significant majority of the collocations used by CRYSTALP2 include residues with high conformational entropy, or low entropy and high potential to mediate crystal contacts; notably, such residues are utilized by surface entropy reduction methods. We show that the collocations provide complementary information to the hydrophobicity and isoelectric point. Tests on four datasets show that CRYSTALP2 outperforms several existing sequence-based predictors (CRYSTALP, OB-score, and SECRET). CRYSTALP2's accuracy, MCC, and AROC range between 69.3 and 77.5%, 0.39 and 0.55, and 0.72 and 0.79, respectively. Our predictions are similar in quality and are complementary to the predictions of the most recent ParCrys and XtalPred methods. Our results also suggest that, as work in protein crystallization continues (thereby enlarging the population of proteins with known crystallization propensities), the prediction quality of the CRYSTALP2 method should increase. The prediction model and the datasets used in this contribution can be downloaded from http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html.
CRYSTALP2 provides relatively accurate crystallization propensity predictions for a given protein chain that either outperform or complement the existing approaches. The proposed method can be used to support current efforts towards improving the success rate in obtaining diffraction-quality crystals.
当前的方案仅能为不到30%的已知蛋白质生成晶体,这表明自动识别可结晶蛋白质可能会提高高通量结构基因组学的研究效率。我们引入了CRYSTALP2,这是一种基于核的方法,可预测给定蛋白质序列产生衍射质量晶体的倾向。该方法利用从一级序列估计的氨基酸组成和搭配、等电点以及疏水性来进行预测。CRYSTALP2在其前身CRYSTALP的基础上进行了扩展,能够对任意长度的序列进行预测,并提高了预测质量。
CRYSTALP2使用的绝大多数搭配包含具有高构象熵的残基,或低熵且具有高潜力介导晶体接触的残基;值得注意的是,此类残基被表面熵降低方法所利用。我们表明,这些搭配为疏水性和等电点提供了补充信息。在四个数据集上的测试表明,CRYSTALP2优于几种现有的基于序列的预测器(CRYSTALP、OB评分和SECRET)。CRYSTALP2的准确率、马修斯相关系数(MCC)和曲线下面积(AROC)分别在69.3%至77.5%、0.39至0.55以及0.72至0.79之间。我们的预测质量相似,并且与最新的ParCrys和XtalPred方法的预测互补。我们的结果还表明,随着蛋白质结晶研究的继续(从而扩大具有已知结晶倾向的蛋白质群体),CRYSTALP2方法的预测质量应该会提高。本研究中使用的预测模型和数据集可从http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html下载。
CRYSTALP2为给定的蛋白质链提供了相对准确的结晶倾向预测,其性能优于或补充了现有方法。所提出的方法可用于支持当前提高获得衍射质量晶体成功率方面的工作。