Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL, USA.
J Theor Biol. 2012 Aug 7;306:115-9. doi: 10.1016/j.jtbi.2012.04.028. Epub 2012 May 2.
Production of high-quality diffracting crystals is a critical step in determining the 3D structure of a protein by X-ray crystallography. Only 2%-10% of crystallization projects result in high-resolution protein structures. Previously, several computational methods for prediction of protein crystallizability were developed. In this work, we introduce RFCRYS, a Random Forest based method to predict crystallizability of proteins. RFCRYS utilizes mono-, di-, and tri-peptides amino acid compositions, frequencies of amino acids in different physicochemical groups, isoelectric point, molecular weight, and length of protein sequences, from the primary sequences to predict crystallizabillity by using two different databases. RFCRYS was compared with previous methods and the results obtained show that our proposed method using this set of features outperforms existing predictors with higher accuracy, MCC, and Specificity. Especially, our method is characterized by high Specificity of 0.95, which means RFCRYS rarely mispredicts a protein chain to be crystallizable which consequently would be useful for saving time and resources. In conclusion RFCRYS provides accurate crystallizability prediction for a protein chain that can be applied to support crystallization projects getting higher success rate towards obtaining diffraction-quality crystals.
通过 X 射线晶体学确定蛋白质的 3D 结构,高质量衍射晶体的产生是关键步骤。只有 2%-10%的结晶项目能得到高分辨率的蛋白质结构。以前,已经开发了几种用于预测蛋白质可结晶性的计算方法。在这项工作中,我们引入了 RFCRYS,这是一种基于随机森林的方法,用于预测蛋白质的可结晶性。RFCRYS 利用单肽、二肽和三肽的氨基酸组成、不同理化组中氨基酸的频率、等电点、分子量和蛋白质序列长度,从一级序列中利用两个不同的数据库预测可结晶性。我们将 RFCRYS 与以前的方法进行了比较,结果表明,我们提出的方法使用这组特征,具有更高的准确性、MCC 和特异性,优于现有预测器。特别是,我们的方法的特异性高达 0.95,这意味着 RFCRYS 很少错误预测一个蛋白质链是可结晶的,这将有助于节省时间和资源。总之,RFCRYS 为蛋白质链提供了准确的可结晶性预测,可用于支持结晶项目,提高获得衍射质量晶体的成功率。