Department of Biotechnology and Bioengineering, Huaqiao University, Xiamen 361021, Fujian, PR China.
Int J Biol Macromol. 2013 Feb;53:1-6. doi: 10.1016/j.ijbiomac.2012.10.031. Epub 2012 Nov 6.
To investigate the molecular features responsible for protein halophilicity is of great significance for understanding the structure basis of protein halo-stability and would help to develop a practical strategy for designing halophilic proteins. In this work, we have systematically analyzed the dipeptide composition of the halophilic and non-halophilic protein sequences. We observed the halophilic proteins contained more DA, RA, AD, RR, AP, DD, PD, EA, VG and DV at the expense of LK, IL, II, IA, KK, IS, KA, GK, RK and AI. We identified some macromolecular signatures of halo-adaptation, and thought the dipeptide composition might contain more information than amino acid composition. Based on the dipeptide composition, we have developed a machine learning method for classifying halophilic and non-halophilic proteins for the first time. The accuracy of our method for the training dataset was 100.0%, and for the 10-fold cross-validation was 93.1%. We also discussed the influence of some specific dipeptides on prediction accuracy.
研究导致蛋白质嗜盐性的分子特征对于理解蛋白质耐盐性的结构基础具有重要意义,并有助于开发设计嗜盐蛋白的实用策略。在这项工作中,我们系统地分析了嗜盐和非嗜盐蛋白序列的二肽组成。我们观察到嗜盐蛋白中含有更多的 DA、RA、AD、RR、AP、DD、PD、EA、VG 和 DV,而 LK、IL、II、IA、KK、IS、KA、GK、RK 和 AI 则较少。我们确定了一些适应盐度的大分子特征,认为二肽组成可能比氨基酸组成包含更多信息。基于二肽组成,我们首次开发了一种用于分类嗜盐和非嗜盐蛋白的机器学习方法。我们方法对训练数据集的准确率为 100.0%,对 10 倍交叉验证的准确率为 93.1%。我们还讨论了一些特定二肽对预测准确性的影响。