Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.
PLoS One. 2013 Sep 3;8(9):e72368. doi: 10.1371/journal.pone.0072368. eCollection 2013.
Existing methods for predicting protein crystallization obtain high accuracy using various types of complemented features and complex ensemble classifiers, such as support vector machine (SVM) and Random Forest classifiers. It is desirable to develop a simple and easily interpretable prediction method with informative sequence features to provide insights into protein crystallization. This study proposes an ensemble method, SCMCRYS, to predict protein crystallization, for which each classifier is built by using a scoring card method (SCM) with estimating propensity scores of p-collocated amino acid (AA) pairs (p=0 for a dipeptide). The SCM classifier determines the crystallization of a sequence according to a weighted-sum score. The weights are the composition of the p-collocated AA pairs, and the propensity scores of these AA pairs are estimated using a statistic with optimization approach. SCMCRYS predicts the crystallization using a simple voting method from a number of SCM classifiers. The experimental results show that the single SCM classifier utilizing dipeptide composition with accuracy of 73.90% is comparable to the best previously-developed SVM-based classifier, SVM_POLY (74.6%), and our proposed SVM-based classifier utilizing the same dipeptide composition (77.55%). The SCMCRYS method with accuracy of 76.1% is comparable to the state-of-the-art ensemble methods PPCpred (76.8%) and RFCRYS (80.0%), which used the SVM and Random Forest classifiers, respectively. This study also investigates mutagenesis analysis based on SCM and the result reveals the hypothesis that the mutagenesis of surface residues Ala and Cys has large and small probabilities of enhancing protein crystallizability considering the estimated scores of crystallizability and solubility, melting point, molecular weight and conformational entropy of amino acids in a generalized condition. The propensity scores of amino acids and dipeptides for estimating the protein crystallizability can aid biologists in designing mutation of surface residues to enhance protein crystallizability. The source code of SCMCRYS is available at http://iclab.life.nctu.edu.tw/SCMCRYS/.
现有的蛋白质结晶预测方法使用各种类型的互补特征和复杂的集成分类器,如支持向量机(SVM)和随机森林分类器,来获得高精度。理想情况下,我们希望开发一种简单且易于解释的预测方法,该方法使用有信息量的序列特征,从而深入了解蛋白质结晶。本研究提出了一种集成方法 SCMCRYS,用于预测蛋白质结晶,其中每个分类器都是使用评分卡方法(SCM)构建的,该方法用于估计 p-共定位氨基酸(AA)对的倾向性得分(p=0 用于二肽)。SCM 分类器根据加权和分数来确定序列的结晶性。权重是 p-共定位 AA 对的组成,这些 AA 对的倾向性得分是使用具有优化方法的统计量来估计的。SCMCRYS 使用来自多个 SCM 分类器的简单投票方法来预测结晶。实验结果表明,利用准确率为 73.90%的二肽组成的单个 SCM 分类器可与先前开发的基于 SVM 的最佳分类器 SVM_POLY(74.6%)相媲美,并且我们提出的基于 SVM 的分类器利用相同的二肽组成(77.55%)也可与 SVM 相媲美。准确率为 76.1%的 SCMCRYS 方法可与最先进的集成方法 PPCpred(76.8%)和 RFCRYS(80.0%)相媲美,后者分别使用 SVM 和随机森林分类器。本研究还基于 SCM 进行了诱变分析,结果表明,在广义条件下,考虑到可结晶性和溶解度、熔点、氨基酸的分子量和构象熵的估计得分,表面残基 Ala 和 Cys 的突变具有增强蛋白质可结晶性的大、小概率的假设。用于估计蛋白质可结晶性的氨基酸和二肽的倾向性得分可以帮助生物学家设计表面残基的突变以增强蛋白质可结晶性。SCMCRYS 的源代码可在 http://iclab.life.nctu.edu.tw/SCMCRYS/ 上获得。