Wu Sitao, Zhang Yang
Center for Bioinformatics and Department of Molecular Bioscience, University of Kansas, 2030 Becker Dr, Lawrence, KS 66047, USA.
Bioinformatics. 2008 Apr 1;24(7):924-31. doi: 10.1093/bioinformatics/btn069. Epub 2008 Feb 22.
Pair-wise residue-residue contacts in proteins can be predicted from both threading templates and sequence-based machine learning. However, most structure modeling approaches only use the template-based contact predictions in guiding the simulations; this is partly because the sequence-based contact predictions are usually considered to be less accurate than that by threading. With the rapid progress in sequence databases and machine-learning techniques, it is necessary to have a detailed and comprehensive assessment of the contact-prediction methods in different template conditions.
We develop two methods for protein-contact predictions: SVM-SEQ is a sequence-based machine learning approach which trains a variety of sequence-derived features on contact maps; SVM-LOMETS collects consensus contact predictions from multiple threading templates. We test both methods on the same set of 554 proteins which are categorized into 'Easy', 'Medium', 'Hard' and 'Very Hard' targets based on the evolutionary and structural distance between templates and targets. For the Easy and Medium targets, SVM-LOMETS obviously outperforms SVM-SEQ; but for the Hard and Very Hard targets, the accuracy of the SVM-SEQ predictions is higher than that of SVM-LOMETS by 12-25%. If we combine the SVM-SEQ and SVM-LOMETS predictions together, the total number of correctly predicted contacts in the Hard proteins will increase by more than 60% (or 70% for the long-range contact with a sequence separation > or =24), compared with SVM-LOMETS alone. The advantage of SVM-SEQ is also shown in the CASP7 free modeling targets where the SVM-SEQ is around four times more accurate than SVM-LOMETS in the long-range contact prediction. These data demonstrate that the state-of-the-art sequence-based contact prediction has reached a level which may be helpful in assisting tertiary structure modeling for the targets which do not have close structure templates. The maximum yield should be obtained by the combination of both sequence- and template-based predictions.
蛋白质中残基与残基之间的成对接触可以通过穿线模板和基于序列的机器学习来预测。然而,大多数结构建模方法仅使用基于模板的接触预测来指导模拟;部分原因是基于序列的接触预测通常被认为不如穿线法准确。随着序列数据库和机器学习技术的快速发展,有必要在不同模板条件下对接触预测方法进行详细而全面的评估。
我们开发了两种蛋白质接触预测方法:SVM-SEQ是一种基于序列的机器学习方法,它在接触图上训练各种从序列衍生的特征;SVM-LOMETS从多个穿线模板收集一致的接触预测。我们在同一组554个蛋白质上测试了这两种方法,这些蛋白质根据模板与目标之间的进化和结构距离被分类为“简单”、“中等”、“困难”和“非常困难”目标。对于简单和中等目标,SVM-LOMETS明显优于SVM-SEQ;但对于困难和非常困难目标,SVM-SEQ预测的准确率比SVM-LOMETS高12-25%。如果我们将SVM-SEQ和SVM-LOMETS的预测结合在一起,与单独使用SVM-LOMETS相比,困难蛋白质中正确预测的接触总数将增加60%以上(对于序列间隔>或=24的长程接触则增加70%)。SVM-SEQ的优势在CASP7自由建模目标中也得到了体现,在长程接触预测方面,SVM-SEQ比SVM-LOMETS准确约四倍。这些数据表明,基于序列的最新接触预测已经达到了一个水平,可能有助于为没有紧密结构模板的目标辅助三级结构建模。通过结合基于序列和基于模板的预测应能获得最大收益。