Xia Jiaqi, Peng Zhenling, Qi Dawei, Mu Hongbo, Yang Jianyi
Department of Physics, Northeast Forestry University, Harbin, China.
Center for Applied Mathematics, Tianjin University, Tianjin, China.
Bioinformatics. 2017 Mar 15;33(6):863-870. doi: 10.1093/bioinformatics/btw768.
Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before.
We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information.
http://yanglab.nankai.edu.cn/TA-fold/.
yangjy@nankai.edu.cn or mhb-506@163.com.
Supplementary data are available at Bioinformatics online.
蛋白质折叠分类是蛋白质结构预测中的关键步骤。蛋白质折叠分类有两种可能的方法。一种是基于模板的折叠分配,另一种是使用机器学习算法的从头预测。此前从未探索过将这两种解决方案结合起来以提高预测准确性。
我们开发了两种用于蛋白质折叠分类的算法,即HH-fold和SVM-fold。HH-fold是一种使用HHsearch程序的基于模板的折叠分配算法。SVM-fold是一种基于支持向量机的从头分类算法,其中从三个互补序列谱中提取了一组全面的特征。然后将这两种算法结合起来,形成了集成方法TA-fold。我们通过在六个基准数据集上与从头方法和基于模板的穿线方法进行比较,对所提出的方法进行了全面评估。TA-fold在由来自27种折叠的蛋白质组成的DD数据集上达到了0.799的准确率。这比从头方法提高了5.4 - 11.7%。在更新该数据集以纳入更多相同折叠中的蛋白质后,准确率提高到了0.971。此外,TA-fold在由来自184种折叠的6451种蛋白质组成的大型数据集上达到了>0.9的准确率。在LE数据集上的实验表明,TA-fold在家族、超家族和折叠水平上始终优于其他穿线方法。TA-fold的成功归因于基于模板的折叠分配与使用包含丰富进化信息的互补序列谱特征的从头分类的结合。
http://yanglab.nankai.edu.cn/TA-fold/。
yangjy@nankai.edu.cn或mhb-506@163.com。
补充数据可在《生物信息学》在线获取。