一维结构蛋白描述符及其基于序列的预测。

Structural protein descriptors in 1-dimension and their sequence-based predictions.

机构信息

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada.

出版信息

Curr Protein Pept Sci. 2011 Sep;12(6):470-89. doi: 10.2174/138920311796957711.

DOI:10.2174/138920311796957711

Abstract

The last few decades observed an increasing interest in development and application of 1-dimensional (1D) descriptors of protein structure. These descriptors project 3D structural features onto 1D strings of residue-wise structural assignments. They cover a wide-range of structural aspects including conformation of the backbone, burying depth/solvent exposure and flexibility of residues, and inter-chain residue-residue contacts. We perform first-of-its-kind comprehensive comparative review of the existing 1D structural descriptors. We define, review and categorize ten structural descriptors and we also describe, summarize and contrast over eighty computational models that are used to predict these descriptors from the protein sequences. We show that the majority of the recent sequence-based predictors utilize machine learning models, with the most popular being neural networks, support vector machines, hidden Markov models, and support vector and linear regressions. These methods provide high-throughput predictions and most of them are accessible to a non-expert user via web servers and/or stand-alone software packages. We empirically evaluate several recent sequence-based predictors of secondary structure, disorder, and solvent accessibility descriptors using a benchmark set based on CASP8 targets. Our analysis shows that the secondary structure can be predicted with over 80% accuracy and segment overlap (SOV), disorder with over 0.9 AUC, 0.6 Matthews Correlation Coefficient (MCC), and 75% SOV, and relative solvent accessibility with PCC of 0.7 and MCC of 0.6 (0.86 when homology is used). We demonstrate that the secondary structure predicted from sequence without the use of homology modeling is as good as the structure extracted from the 3D folds predicted by top-performing template-based methods.

摘要

过去几十年，人们对一维（1D）蛋白质结构描述符的开发和应用产生了浓厚的兴趣。这些描述符将 3D 结构特征投影到残基结构分配的 1D 字符串上。它们涵盖了广泛的结构方面，包括骨架构象、埋藏深度/溶剂暴露和残基柔韧性，以及链间残基-残基接触。我们对现有的 1D 结构描述符进行了首次全面的比较性综述。我们定义、回顾和分类了十种结构描述符，并描述、总结和对比了八十多种用于从蛋白质序列预测这些描述符的计算模型。我们表明，大多数最近的基于序列的预测器都使用机器学习模型，其中最受欢迎的是神经网络、支持向量机、隐马尔可夫模型和支持向量与线性回归。这些方法提供高通量预测，其中大多数通过网络服务器和/或独立软件包对非专业用户都可用。我们使用基于 CASP8 靶标的基准集，对几种最近的基于序列的二级结构、无序和溶剂可及性描述符预测器进行了实证评估。我们的分析表明，二级结构的预测准确率超过 80%，片段重叠（SOV）超过 0.9 AUC、0.6 马修斯相关系数（MCC）和 75% SOV，无序的预测准确率超过 0.9 AUC、0.6 MCC 和 75% SOV，相对溶剂可及性的 PCC 为 0.7，MCC 为 0.6（使用同源性时为 0.86）。我们证明，不使用同源建模从序列预测的二级结构与通过表现最佳的基于模板的方法预测的 3D 折叠中提取的结构一样好。