Pei Jimin, Grishin Nick V
Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas 75390-9050, USA.
Proteins. 2004 Sep 1;56(4):782-94. doi: 10.1002/prot.20158.
We study the effects of various factors in representing and combining evolutionary and structural information for local protein structural prediction based on fragment selection. We prepare databases of fragments from a set of non-redundant protein domains. For each fragment, evolutionary information is derived from homologous sequences and represented as estimated effective counts and frequencies of amino acids (evolutionary frequencies) at each position. Position-specific amino acid preferences called structural frequencies are derived from statistical analysis of discrete local structural environments in database structures. Our method for local structure prediction is based on ranking and selecting database fragments that are most similar to a target fragment. Using secondary structure type as a local structural property, we test our method in a number of settings. The major findings are: (1) the COMPASS-type scoring function for fragment similarity comparison gives better prediction accuracy than three other tested scoring functions for profile-profile comparison. We show that the COMPASS-type scoring function can be derived both in the probabilistic framework and in the framework of statistical potentials. (2) Using the evolutionary frequencies of database fragments gives better prediction accuracy than using structural frequencies. (3) Finer definition of local environments, such as including more side-chain solvent accessibility classes and considering the backbone conformations of neighboring residues, gives increasingly better prediction accuracy using structural frequencies. (4) Combining evolutionary and structural frequencies of database fragments, either in a linear fashion or using a pseudocount mixture formula, results in improvement of prediction accuracy. Combination at the log-odds score level is not as effective as combination at the frequency level. This suggests that there might be better ways of combining sequence and structural information than the commonly used linear combination of log-odds scores. Our method of fragment selection and frequency combination gives reasonable results of secondary structure prediction tested on 56 CASP5 targets (average SOV score 0.77), suggesting that it is a valid method for local protein structure prediction. Mixture of predicted structural frequencies and evolutionary frequencies improve the quality of local profile-to-profile alignment by COMPASS.
我们研究了基于片段选择的局部蛋白质结构预测中,各种因素在表示和组合进化信息与结构信息方面的作用。我们从一组非冗余蛋白质结构域中制备了片段数据库。对于每个片段,进化信息来自同源序列,并表示为每个位置氨基酸的估计有效计数和频率(进化频率)。通过对数据库结构中离散局部结构环境的统计分析,得出称为结构频率的位置特异性氨基酸偏好。我们的局部结构预测方法基于对与目标片段最相似的数据库片段进行排序和选择。以二级结构类型作为局部结构属性,我们在多种设置下测试了我们的方法。主要发现如下:(1)用于片段相似性比较的COMPASS型评分函数比用于轮廓-轮廓比较的其他三种测试评分函数具有更高的预测准确性。我们表明,COMPASS型评分函数既可以在概率框架中推导,也可以在统计势框架中推导。(2)使用数据库片段的进化频率比使用结构频率具有更高的预测准确性。(3)对局部环境进行更精细的定义,例如包括更多的侧链溶剂可及性类别并考虑相邻残基的主链构象,使用结构频率时预测准确性会越来越高。(4)以线性方式或使用伪计数混合公式组合数据库片段的进化频率和结构频率,可提高预测准确性。对数几率得分水平的组合不如频率水平的组合有效。这表明可能存在比常用的对数几率得分线性组合更好的序列和结构信息组合方式。我们的片段选择和频率组合方法在对56个CASP5目标进行测试时,给出了合理的二级结构预测结果(平均SOV评分为0.77),表明它是一种有效的局部蛋白质结构预测方法。预测的结构频率和进化频率的混合提高了COMPASS进行局部轮廓-轮廓比对的质量。