MTi, Inserm UMR-S 973, Université Paris Diderot- Paris 7, Paris, F-75205 Cedex 13, France.
BMC Bioinformatics. 2010 Feb 4;11:75. doi: 10.1186/1471-2105-11-75.
Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.
We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 A). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.
We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.
蛋白质环包含了三维结构中 50%的蛋白质残基。这些区域通常与蛋白质的功能有关,例如结合部位、催化口袋等。然而,用传统工具描述蛋白质环是一项艰巨的任务。规则的二级结构,如螺旋和链,已经得到了广泛的研究,而环由于其序列和结构高度可变,因此难以分析。由于数据稀疏,长环很少被系统地研究。
我们开发了一种简单而准确的方法,允许使用结构基序来描述和分析短环和长环的结构,而不受环长度的限制。该方法基于结构字母 HMM-SA。HMM-SA 允许将三维蛋白质结构简化为一维状态字符串,其中每个状态是一个由四个残基组成的原型片段,称为结构字母。因此,通过将结构字母字符串作为常规蛋白质序列分析来处理,就可以轻松完成对庞大数据集的结构分组任务。我们系统地提取了 93000 个蛋白质环库中的所有七个残基片段,并根据结构字母序列进行了分组,称为结构字。这种方法允许对所有大小的环进行系统分析,因为我们考虑的是七个残基的结构基序,而不是完整的环。我们将分析重点放在高度重复的环结构字(观察到 30 次以上)上。我们的研究表明,在观察到的 28274 个结构字中,只有 3310 个高度重复的结构字(观察到 30 次以上)覆盖了 73%的环长度。这些结构字的结构变异性较低(平均 RMSd 为 0.85A)。正如预期的那样,这些基序中有一半表现出侧翼区域偏好,但有趣的是,三分之二的基序存在于短(小于 12 个残基)和长环中。此外,一半的重复基序表现出显著的氨基酸保守性,至少有四个显著位置,87%的长环包含至少一个这样的基序。我们通过检测结构字母的统计上过度表达模式(如在常规 DNA 序列分析中)来补充我们的分析。约 30%(930 个)的结构字过度表达,覆盖了约 40%的环长度。有趣的是,这些字表现出较低的结构变异性和较高的序列特异性,表明存在结构或功能限制。
我们开发了一种使用重复结构基序系统地分解和研究蛋白质环的方法。该方法基于结构字母 HMM-SA,而不是结构比对和几何参数。我们提取了在短环和长环中都存在的有意义的结构基序。据我们所知,这是首次使用模式挖掘来提高蛋白质环中的信号噪声比。这一发现有助于更好地描述蛋白质环,并可能有助于降低长环分析的复杂性。详细结果可在 http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/ 获得。