Stultz C M, White J V, Smith T F
Committee on Higher Degrees on Biophysics, Harvard University, Cambridge, Massachusetts 02138.
Protein Sci. 1993 Mar;2(3):305-14. doi: 10.1002/pro.5560020302.
A new method has been developed to compute the probability that each amino acid in a protein sequence is in a particular secondary structural element. Each of these probabilities is computed using the entire sequence and a set of predefined structural class models. This set of structural classes is patterned after Jane Richardson's taxonomy for the domains of globular proteins. For each structural class considered, a mathematical model is constructed to represent constraints on the pattern of secondary structural elements characteristic of that class. These are stochastic models having discrete state spaces (referred to as hidden Markov models by researchers in signal processing and automatic speech recognition). Each model is a mathematical generator of amino acid sequences; the sequence under consideration is modeled as having been generated by one model in the set of candidates. The probability that each model generated the given sequence is computed using a filtering algorithm. The protein is then classified as belonging to the structural class having the most probable model. The secondary structure of the sequence is then analyzed using a "smoothing" algorithm that is optimal for that structural class model. For each residue position in the sequence, the smoother computes the probability that the residue is contained within each of the defined secondary structural elements of the model. This method has two important advantages: (1) the probability of each residue being in each of the modeled secondary structural elements is computed using the totality of the amino acid sequence, and (2) these probabilities are consistent with prior knowledge of realizable domain folds as encoded in each model. As an example of the method's utility, we present its application to flavodoxin, a prototypical alpha/beta protein having a central beta-sheet, and to thioredoxin, which belongs to a similar structural class but shares no significant sequence similarity.
已开发出一种新方法来计算蛋白质序列中每个氨基酸处于特定二级结构元件中的概率。这些概率中的每一个都是使用整个序列和一组预定义的结构类模型来计算的。这组结构类是仿照简·理查森(Jane Richardson)对球状蛋白质结构域的分类法构建的。对于所考虑的每个结构类,构建一个数学模型来表示对该类特征性二级结构元件模式的约束。这些是具有离散状态空间的随机模型(信号处理和自动语音识别领域的研究人员称之为隐马尔可夫模型)。每个模型都是氨基酸序列的数学生成器;所考虑的序列被建模为是由候选集中的一个模型生成的。使用滤波算法计算每个模型生成给定序列的概率。然后将蛋白质分类为属于具有最可能模型的结构类。接着使用对该结构类模型最优的“平滑”算法来分析序列的二级结构。对于序列中的每个残基位置,平滑器计算该残基包含在模型中每个定义的二级结构元件内的概率。该方法有两个重要优点:(1)使用氨基酸序列的整体来计算每个残基处于每个建模二级结构元件中的概率,(2)这些概率与每个模型中编码的可实现结构域折叠的先验知识一致。作为该方法实用性的一个例子,我们展示了它在黄素氧还蛋白(一种具有中央β折叠片的典型α/β蛋白)和硫氧还蛋白(它属于类似的结构类,但没有显著的序列相似性)上的应用。