Strait B J, Dewey T G
Department of Chemistry, University of Denver, Colorado 80208, USA.
Biophys J. 1996 Jul;71(1):148-55. doi: 10.1016/S0006-3495(96)79210-X.
A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.
分析一个综合数据库以确定蛋白质序列的香农信息含量。通过三种方法估计这种信息熵:k元组分析、广义齐普夫分析和“周-法斯曼赌徒法”。k元组分析是一种基于条件序列概率的“字母”分析。广义齐普夫分析展示了蛋白质序列的统计语言学特性,并使用“单词”频率来确定香农熵。齐普夫分析和k元组分析得出的香农熵约为2.5比特/氨基酸。这个熵远小于从蛋白质中氨基酸的非均匀组成获得的4.18比特/氨基酸的值。“周-法斯曼”赌徒法是一种基于周-法斯曼蛋白质结构规则的算法。它使用序列和二级结构信息来猜测可以适当地替换到一个序列中的可能氨基酸的数量。与英语的情况一样,赌徒算法得出的熵明显低于k元组分析。利用这些熵,可以计算出最可能的蛋白质序列的数量。最可能的蛋白质序列的数量远少于可能序列的数量,但仍然远大于整个进化过程中被认为存在的序列数量。讨论了这些结果对诱变实验的意义。