Niu Xiao-Hui, Hu Xue-Hai, Shi Feng, Xia Jing-Bo
College of Science, Huazhong, Agricultural University, Wuhan, P.R. of China.
Protein Pept Lett. 2012 Sep;19(9):940-8. doi: 10.2174/092986612802084492.
Obtaining soluble proteins in sufficient concentrations is a major obstacle in various experimental studies. How to predict the propensity of targets in large-scale proteomics projects to be soluble is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) can investigate the patterns hiding in protein sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert each protein sequence into a high-dimensional vector by CGR algorithm and fractal dimension, and then predict protein solubility by these fractal features together with Chou's pseudo amino acid composition features and support vector machine (SVM). We extract and study six groups of features computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test. As the results of comparisons, the group of 445-dimensional vector gets the best results, the average accuracy is 0.8741 and average MCC is 0.7358. The resulting predictor is also compared with existing methods and shows significant improvement.
在各种实验研究中,获得足够浓度的可溶性蛋白质是一个主要障碍。如何在大规模蛋白质组学项目中预测靶标蛋白的可溶性倾向是一个重大但尚未得到充分解决的科学问题。混沌游戏表示法(CGR)可以研究隐藏在蛋白质序列中的模式,并能直观地揭示以前未知的结构。分形维数是测量复杂、高度不规则几何物体大小的良好工具。在本文中,我们通过CGR算法和分形维数将每个蛋白质序列转换为高维向量,然后结合周氏伪氨基酸组成特征和支持向量机(SVM),利用这些分形特征预测蛋白质的溶解性。我们提取并研究了直接从一级序列计算得到的六组特征,并通过10倍交叉验证测试对每组特征进行评估。作为比较结果,445维向量组取得了最佳结果,平均准确率为0.8741,平均马修斯相关系数为0.7358。我们还将所得预测器与现有方法进行了比较,结果显示有显著改进。