Ouzounis C, Sander C, Scharf M, Schneider R
Protein Design Group, EMBL, Heidelberg, Germany.
J Mol Biol. 1993 Aug 5;232(3):805-25. doi: 10.1006/jmbi.1993.1433.
The problem of protein structure prediction is formulated here as that of evaluating how well an amino acid sequence fits a hypothetical structure. The simplest and most complicated approaches, secondary structure prediction and all-atom free energy calculations, can be viewed as sequence-structure fitness problems. Here, an approach of intermediate complexity is described, which involves; (1) description of a protein structure in terms of contact interface vectors, with both intra-protein and protein-solvent contacts counted, (2) derivation of sequence preferences for 2 up to 29 contact interface types, (3) generation of numerous hypothetical model structures by placing the input sequence into a large set of known three-dimensional structures in all possible alignments, (4) evaluation of these models by summing the sequence preferences over all structural positions and (5) choice of predicted three-dimensional structure as that with the best sequence-structure fitness. Evolutionary information is incorporated by using position-dependent core weights derived from multiple sequence alignments. A number of tests of the method are performed: (1) evaluation of cyclic shifts of a sequence in its native structure; (2) alignment of a sequence in its native structure, allowing gaps; (3) alignment search with a sequence or sequence fragment in a database of structures; and (4) alignment search with a structure in a database of sequences. The main results are: (1) a native sequence can very well find its native structure among a large number of alternatives, in correct alignment; (2) substructures, such as (beta alpha)n units, can be detected in spite of very low sequence similarity; (3) remote homologous can be detected, with some dependence on the set of parameters used; (4) contact interface parameters are clearly superior to classical secondary structure parameters; (5) a simple interface description in terms of just two states, protein-protein and protein-water contacts, performs surprisingly well; (6) the use of core weights considerably improves accuracy in detection of remote homologues; (7) based on a sequence database search with a myoglobin contact profile, the C-terminal domain of a viral origin of replication binding protein is predicted to have an all-helical fold. The sequence-structure fitness concept is sufficiently general to accommodate a large variety of protein structure prediction methods, including new models of intermediate complexity currently being developed.
蛋白质结构预测问题在此被阐述为评估氨基酸序列与假设结构的匹配程度。最简单和最复杂的方法,即二级结构预测和全原子自由能计算,都可被视为序列 - 结构匹配问题。这里描述了一种中等复杂度的方法,该方法包括:(1)根据接触界面向量描述蛋白质结构,同时计算蛋白质内部和蛋白质 - 溶剂接触;(2)推导2至29种接触界面类型的序列偏好;(3)通过将输入序列以所有可能的比对方式放入大量已知三维结构中,生成众多假设模型结构;(4)通过对所有结构位置的序列偏好求和来评估这些模型;(5)选择序列 - 结构匹配度最佳的预测三维结构。通过使用从多序列比对中得出的位置依赖核心权重纳入进化信息。对该方法进行了多项测试:(1)评估序列在其天然结构中的循环移位;(2)在其天然结构中对序列进行比对,允许有缺口;(3)在结构数据库中用序列或序列片段进行比对搜索;(4)在序列数据库中用结构进行比对搜索。主要结果如下:(1)天然序列能够在大量备选结构中很好地找到其天然结构,并具有正确的比对;(2)尽管序列相似性很低,如(β - α)n单元等亚结构仍可被检测到;(3)可以检测到远源同源物,这在一定程度上依赖于所使用的参数集;(4)接触界面参数明显优于经典二级结构参数;(5)仅根据蛋白质 - 蛋白质和蛋白质 - 水接触这两种状态进行的简单界面描述表现出奇地好;(6)使用核心权重显著提高了检测远源同源物的准确性;(7)基于用肌红蛋白接触谱在序列数据库中进行的搜索,预测病毒复制起始结合蛋白的C端结构域具有全螺旋折叠。序列 - 结构匹配概念足够通用,能够容纳多种蛋白质结构预测方法,包括目前正在开发的中等复杂度新模型。