Department of Biochemistry and Molecular Genetics, University of Virginia, Jordan Hall Box 800733, Charlottesville, VA 22908, USA.
Bioinformatics. 2010 Feb 1;26(3):310-8. doi: 10.1093/bioinformatics/btp660. Epub 2009 Nov 30.
To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.
While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in alpha-helical secondary structures (but not beta-strands). Five-residue consensus exceptional words are enriched for alpha-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for alpha-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.
Supplementary data are available at Bioinformatics online.
为了测试蛋白质折叠约束和二级结构序列偏好是否显著减少蛋白质中氨基酸单词的空间,我们将蛋白质中四肽和五肽单词簇(独立单词)的频率与四个随机序列模型的预测频率进行了比较。
尽管人类蛋白质组中有许多过度表示的单词簇,但这些单词来自于组成偏向性较大的大型蛋白质家族(例如 Zn 指)。相比之下,在 Pfam-AB 的非冗余样本中,与我们最简单的随机模型 [MC(0)] 相比,只有 1%的四肽单词簇(4.7%的 5mer 单词)是两倍过度表示,而 0.1%(4mers)至 0.5%(5mers)是两倍过度表示与窗口打乱的随机模型相比。使用错误发现率 q 值分析,真实蛋白质中异常的四字母或五字母单词的数量与将一个随机模型中的单词与另一个随机模型中的单词进行比较时发现的数量相似。共识过度表示的单词在蛋白质的保守区域没有富集,但四字母单词在α-螺旋二级结构中富集 1.18-1.56 倍(但不在β-折叠中)。四残基共识异常单词在α-螺旋中富集 1.43-1.61 倍。规则二级结构中的蛋白质单词偏好似乎并没有显著限制在不相关蛋白质中使用序列单词,尽管共识异常单词对α-螺旋有二级结构偏向。总体而言,蛋白质序列中的单词似乎受到很少的限制;在很大程度上,它们似乎是随机的。
补充数据可在 Bioinformatics 在线获取。