IEEE Trans Nanobioscience. 2015 Jun;14(4):339-349. doi: 10.1109/TNB.2014.2352454. Epub 2014 Sep 15.
Protein structural classes information is beneficial for secondary and tertiary structure prediction, protein folds prediction, and protein function analysis. Thus, predicting protein structural classes is of vital importance. In recent years, several computational methods have been developed for low-sequence-similarity (25%-40%) protein structural classes prediction. However, the reported prediction accuracies are actually not satisfactory. Aiming to further improve the prediction accuracies, we propose three different feature extraction methods and construct a comprehensive feature set that captures both sequence and structure information. By applying a random forest (RF) classifier to the feature set, we further develop a novel method for structural classes prediction. We test the proposed method on three benchmark datasets (25PDB, 640, and 1189) with low sequence similarity, and obtain the overall prediction accuracies of 93.5%, 92.6%, and 93.4%, respectively. Compared with six competing methods, the accuracies we achieved are 3.4%, 6.2%, and 8.7% higher than those achieved by the best-performing methods, showing the superiority of our method. Moreover, due to the limitation of the size of the three benchmark datasets, we further test the proposed method on three updated large-scale datasets with different sequence similarities (40%, 30%, and 25%). The proposed method achieves above 90% accuracies for all the three datasets, consistent with the accuracies on the above three benchmark datasets. Experimental results suggest our method as an effective and promising tool for structural classes prediction. Currently, a webserver that implements the proposed method is available on http://121.192.180.204:8080/RF_PSCP/Index.html.
蛋白质结构类别信息有利于二级和三级结构预测、蛋白质折叠预测以及蛋白质功能分析。因此,预测蛋白质结构类别至关重要。近年来,已开发出多种计算方法用于低序列相似性(25%-40%)蛋白质结构类别的预测。然而,所报道的预测准确率实际上并不令人满意。为了进一步提高预测准确率,我们提出了三种不同的特征提取方法,并构建了一个综合特征集,该特征集能同时捕捉序列和结构信息。通过将随机森林(RF)分类器应用于该特征集,我们进一步开发了一种用于结构类别预测的新方法。我们在三个低序列相似性的基准数据集(25PDB、640和1189)上测试了所提出的方法,分别获得了93.5%、92.6%和93.4%的总体预测准确率。与六种竞争方法相比,我们所取得的准确率比表现最佳的方法分别高出3.4%、6.2%和8.7%,显示了我们方法的优越性。此外,由于这三个基准数据集规模的限制,我们进一步在三个具有不同序列相似性(40%、30%和25%)的更新后的大规模数据集上测试了所提出的方法。所提出的方法在所有这三个数据集上都达到了90%以上的准确率,与上述三个基准数据集上的准确率一致。实验结果表明我们的方法是一种用于结构类别预测的有效且有前景的工具。目前,一个实现所提出方法的网络服务器可在http://121.192.180.204:8080/RF_PSCP/Index.html上获取。