Kruus Erik, Thumfort Peter, Tang Chao, Wingreen Ned S
NEC Laboratories America, Inc. 4 Independence Way, Princeton, NJ 08544, USA.
Nucleic Acids Res. 2005 Sep 20;33(16):5343-53. doi: 10.1093/nar/gki842. Print 2005.
Protein backbones have characteristic secondary structures, including alpha-helices and beta-sheets. Which structure is adopted locally is strongly biased by the local amino acid sequence of the protein. Accurate (probabilistic) mappings from sequence to structure are valuable for both secondary-structure prediction and protein design. For the case of alpha-helix caps, we test whether the information content of the sequence-structure mapping can be self-consistently improved by using a relaxed definition of the structure. We derive helix-cap sequence motifs using database helix assignments for proteins of known structure. These motifs are refined using Gibbs sampling in competition with a null motif. Then Gibbs sampling is repeated, allowing for frameshifts of +/-1 amino acid residue, in order to find sequence motifs of higher total information content. All helix-cap motifs were found to have good generalization capability, as judged by training on a small set of non-redundant proteins and testing on a larger set. For overall prediction purposes, frameshift motifs using all training examples yielded the best results. Frameshift motifs using a fraction of all training examples performed best in terms of true positives among top predictions. However, motifs without frameshifts also performed well, despite a roughly one-third lower total information content.
蛋白质主链具有特征性的二级结构,包括α螺旋和β折叠。局部采用哪种结构在很大程度上受蛋白质局部氨基酸序列的影响。从序列到结构的准确(概率性)映射对于二级结构预测和蛋白质设计都很有价值。对于α螺旋帽的情况,我们测试了使用结构的宽松定义是否能自洽地提高序列-结构映射的信息含量。我们利用已知结构蛋白质的数据库螺旋分配推导螺旋帽序列基序。这些基序通过与空基序竞争的吉布斯采样进行优化。然后重复吉布斯采样,允许有±1个氨基酸残基的移码,以找到总信息含量更高的序列基序。通过在一小部分非冗余蛋白质上进行训练并在更大的集合上进行测试判断,所有螺旋帽基序都具有良好的泛化能力。对于整体预测目的,使用所有训练示例的移码基序产生了最佳结果。在顶级预测中,使用所有训练示例的一部分的移码基序在真阳性方面表现最佳。然而,没有移码的基序也表现良好,尽管总信息含量大约低三分之一。