Mamitsuka H, Abe N
Theory NEC Laboratory, RWCP, Kawasaki, Japan.
Proc Int Conf Intell Syst Mol Biol. 1994;2:276-84.
We describe and demonstrate the effectiveness of a method of predicting protein secondary structures, beta-sheet regions in particular, using a class of stochastic tree grammars as representational language for their amino acid sequence patterns. The family of stochastic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic grammars that are expressive enough to capture the kind of long-distance dependencies exhibited by the sequences of beta-sheet regions, and at the same time enjoy relatively efficient processing. We applied our method on real data obtained from the HSSP database and the results obtained are encouraging: Using an SRNRG trained by data of a particular protein, our method was actually able to predict the location and structure of beta-sheet regions in a number of different proteins, whose sequences are less than 25 per cent homologous to the training sequences. The learning algorithm we use is an extension of the 'Inside-Outside' algorithm for stochastic context free grammars, but with a number of significant modifications. First, we restricted the grammars used to be members of the 'linear' subclass of SRNRG, and devised simpler and faster algorithms for this subclass. Secondly, we reduced the alphabet size (i.e. the number of amino acids) by clustering them using their physicochemical properties, gradually through the iterations of the learning algorithm. Finally, we parallelized our parsing algorithm to run on a highly parallel computer, a 32-processor CM-5, and were able to obtain a nearly linear speed-up.(ABSTRACT TRUNCATED AT 250 WORDS)
我们描述并展示了一种预测蛋白质二级结构(特别是β折叠区域)的方法的有效性,该方法使用一类随机树文法作为其氨基酸序列模式的表示语言。我们使用的随机树文法家系,即随机排序节点重写文法(SRNRG),是少数能够充分表达β折叠区域序列所呈现的那种长距离依赖性,同时又具有相对高效处理能力的随机文法家系之一。我们将我们的方法应用于从HSSP数据库获得的真实数据,所得结果令人鼓舞:使用由特定蛋白质的数据训练的SRNRG,我们的方法实际上能够预测许多不同蛋白质中β折叠区域的位置和结构,这些蛋白质的序列与训练序列的同源性低于25%。我们使用的学习算法是随机上下文无关文法的“内外”算法的扩展,但有一些重大修改。首先,我们将使用的文法限制为SRNRG的“线性”子类的成员,并为该子类设计了更简单、更快的算法。其次,我们通过在学习算法的迭代过程中逐步根据氨基酸的物理化学性质对它们进行聚类,减小了字母表大小(即氨基酸数量)。最后,我们将解析算法并行化,以便在一台高度并行的计算机(一台32处理器的CM - 5)上运行,并能够获得近乎线性的加速比。(摘要截短为250字)