The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan.
PLoS One. 2012;7(11):e50039. doi: 10.1371/journal.pone.0050039. Epub 2012 Nov 21.
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
蛋白质的氨基酸序列决定了它们的三维结构和功能。然而,序列信息如何与结构和功能相关仍然是个谜。在这项研究中,我们表明,至少可以通过将蛋白质的氨基酸序列视为英语单词的集合来提取部分序列信息,这基于一个工作假设,即蛋白质的氨基酸序列由短组成氨基酸序列(SCS)或“单词”组成。我们首先证实英语语言很可能遵循齐夫定律,即幂律的一个特例。我们发现,当排除低阶尾部时,蛋白质中 SCS 的等级-频率图呈现出相似的分布。与自然英语和无空格的“压缩”英语相比,蛋白质中的氨基酸序列显示出更大的线性范围和更小的幂次,具有更重的低阶尾部,表明蛋白质中 SCS 的分布在很大程度上是无标度的。蛋白质中 SCS 的分布模式在物种间相似,但也存在物种特异性特征。基于 SCS 的可用性得分,我们发现序列基序在高可用性位点(即“关键词”)中富集,反之亦然。事实上,给定蛋白质序列中的最高可用性峰通常直接对应于一个序列基序。基序内高可用性位点的氨基酸组成与整个基序和所有蛋白质序列不同,这表明特定 SCS 及其组成氨基酸在基序中的可能具有功能重要性。我们预计,我们基于可用性的单词解码方法可以与序列比对方法互补,从氨基酸序列预测未知蛋白质中功能重要的位点。