• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列的香农信息熵。

The Shannon information entropy of protein sequences.

作者信息

Strait B J, Dewey T G

机构信息

Department of Chemistry, University of Denver, Colorado 80208, USA.

出版信息

Biophys J. 1996 Jul;71(1):148-55. doi: 10.1016/S0006-3495(96)79210-X.

DOI:10.1016/S0006-3495(96)79210-X
PMID:8804598
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1233466/
Abstract

A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.

摘要

分析一个综合数据库以确定蛋白质序列的香农信息含量。通过三种方法估计这种信息熵:k元组分析、广义齐普夫分析和“周-法斯曼赌徒法”。k元组分析是一种基于条件序列概率的“字母”分析。广义齐普夫分析展示了蛋白质序列的统计语言学特性,并使用“单词”频率来确定香农熵。齐普夫分析和k元组分析得出的香农熵约为2.5比特/氨基酸。这个熵远小于从蛋白质中氨基酸的非均匀组成获得的4.18比特/氨基酸的值。“周-法斯曼”赌徒法是一种基于周-法斯曼蛋白质结构规则的算法。它使用序列和二级结构信息来猜测可以适当地替换到一个序列中的可能氨基酸的数量。与英语的情况一样,赌徒算法得出的熵明显低于k元组分析。利用这些熵,可以计算出最可能的蛋白质序列的数量。最可能的蛋白质序列的数量远少于可能序列的数量,但仍然远大于整个进化过程中被认为存在的序列数量。讨论了这些结果对诱变实验的意义。

相似文献

1
The Shannon information entropy of protein sequences.蛋白质序列的香农信息熵。
Biophys J. 1996 Jul;71(1):148-55. doi: 10.1016/S0006-3495(96)79210-X.
2
Improved Chou-Fasman method for protein secondary structure prediction.用于蛋白质二级结构预测的改进型周-法斯曼方法。
BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S14. doi: 10.1186/1471-2105-7-S4-S14.
3
Non-equilibrium thermodynamics of molecular evolution.分子进化的非平衡态热力学
J Theor Biol. 1998 Aug 21;193(4):593-9. doi: 10.1006/jtbi.1998.0724.
4
New Estimations for Shannon and Zipf-Mandelbrot Entropies.香农熵和齐普夫-曼德布罗特熵的新估计
Entropy (Basel). 2018 Aug 16;20(8):608. doi: 10.3390/e20080608.
5
Information content of protein sequences.蛋白质序列的信息内容。
J Theor Biol. 2000 Oct 7;206(3):379-86. doi: 10.1006/jtbi.2000.2138.
6
Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.蛋白质的k字统计量比较研究:从序列到“序列空间”
BMC Bioinformatics. 2008 Sep 23;9:394. doi: 10.1186/1471-2105-9-394.
7
Molecular Information Theory Meets Protein Folding.分子信息论与蛋白质折叠。
J Phys Chem B. 2022 Nov 3;126(43):8655-8668. doi: 10.1021/acs.jpcb.2c04532. Epub 2022 Oct 25.
8
An information-theoretic approach to the prediction of protein structural class.一种基于信息论的蛋白质结构类别预测方法。
J Comput Chem. 2010 Apr 30;31(6):1201-6. doi: 10.1002/jcc.21406.
9
Entropy, semantic relatedness and proximity.熵、语义相关性和邻近度。
Behav Res Methods. 2011 Sep;43(3):746-60. doi: 10.3758/s13428-011-0087-7.
10
Estimating the entropy of DNA sequences.估计DNA序列的熵。
J Theor Biol. 1997 Oct 7;188(3):369-77. doi: 10.1006/jtbi.1997.0493.

引用本文的文献

1
Conserved multiepitopes in STEVORs enable rational design of a fusion antigen vaccine construct with broad immunogenicity.STEVORs中的保守多表位能够合理设计具有广泛免疫原性的融合抗原疫苗构建体。
Emerg Microbes Infect. 2025 Dec;14(1):2552783. doi: 10.1080/22221751.2025.2552783. Epub 2025 Sep 10.
2
FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion.融合编码器:基于多特征融合的内在无序区域识别
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf362.
3
PhosF3C: a feature fusion architecture with fine-tuned protein language model and conformer for prediction of general phosphorylation site.PhosF3C:一种具有微调蛋白质语言模型和用于预测一般磷酸化位点的构象异构体的特征融合架构。
Brief Bioinform. 2025 May 3;26(3). doi: 10.1093/bib/bbaf242.
4
A predictive language model for SARS-CoV-2 evolution.一种用于严重急性呼吸综合征冠状病毒2(SARS-CoV-2)进化的预测性语言模型。
Signal Transduct Target Ther. 2024 Dec 23;9(1):353. doi: 10.1038/s41392-024-02066-x.
5
Heart patient health monitoring system using invasive and non-invasive measurement.使用有创和无创测量的心脏病人健康监测系统。
Sci Rep. 2024 Apr 26;14(1):9614. doi: 10.1038/s41598-024-60500-0.
6
High-throughput Selection of Human de novo-emerged sORFs with High Folding Potential.高通量筛选具有高折叠潜力的人从头出现的 sORF。
Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae069.
7
Substitution Models of Protein Evolution with Selection on Enzymatic Activity.蛋白质进化的替代模型与酶活性选择。
Mol Biol Evol. 2024 Feb 1;41(2). doi: 10.1093/molbev/msae026.
8
DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model.DisoFLAG:基于图的互作蛋白语言模型准确预测蛋白质固有无序及其功能。
BMC Biol. 2024 Jan 2;22(1):3. doi: 10.1186/s12915-023-01803-y.
9
Analysis of a Novel Peptide That Is Capable of Inhibiting the Enzymatic Activity of the Protein Kinase A Catalytic Subunit-Like Protein from Trypanosoma equiperdum.对一种新型肽的分析,该肽能够抑制来自马媾疫锥虫的蛋白激酶A催化亚基样蛋白的酶活性。
Protein J. 2023 Dec;42(6):709-727. doi: 10.1007/s10930-023-10153-1. Epub 2023 Sep 15.
10
Recent changes in the mutational dynamics of the SARS-CoV-2 main protease substantiate the danger of emerging resistance to antiviral drugs.严重急性呼吸综合征冠状病毒2(SARS-CoV-2)主要蛋白酶突变动力学的最新变化证实了对抗病毒药物产生耐药性的风险。
Front Med (Lausanne). 2022 Dec 14;9:1061142. doi: 10.3389/fmed.2022.1061142. eCollection 2022.

本文引用的文献

1
Multifractals and decoded walks: Applications to protein sequence correlations.
Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Dec;52(6):6588-6592. doi: 10.1103/physreve.52.6588.
2
Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.使用统计语言学方法对编码和非编码DNA序列进行系统分析。
Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Sep;52(3):2939-50. doi: 10.1103/physreve.52.2939.
3
Multifractal analysis of solvent accessibilities in proteins.
Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Jul;52(1):880-887. doi: 10.1103/physreve.52.880.
4
Multifractals, encoded walks and the ergodicity of protein sequences.多重分形、编码游走与蛋白质序列的遍历性
Pac Symp Biocomput. 1996:216-29.
5
Engineering of stable and fast-folding sequences of model proteins.模型蛋白稳定且快速折叠序列的工程设计。
Proc Natl Acad Sci U S A. 1993 Aug 1;90(15):7195-9. doi: 10.1073/pnas.90.15.7195.
6
Nonrandomness in protein sequences: evidence for a physically driven stage of evolution?蛋白质序列中的非随机性:进化存在物理驱动阶段的证据?
Proc Natl Acad Sci U S A. 1994 Dec 20;91(26):12972-5. doi: 10.1073/pnas.91.26.12972.
7
LINUS: a hierarchic procedure to predict the fold of a protein.LINUS:一种预测蛋白质折叠的分层程序。
Proteins. 1995 Jun;22(2):81-99. doi: 10.1002/prot.340220202.
8
Combinatorial cassette mutagenesis as a probe of the informational content of protein sequences.组合盒式诱变作为探测蛋白质序列信息内容的手段
Science. 1988 Jul 1;241(4861):53-7. doi: 10.1126/science.3388019.
9
Applied molecular evolution.
J Theor Biol. 1992 Jul 7;157(1):1-7. doi: 10.1016/s0022-5193(05)80753-2.
10
Selection of representative protein data sets.代表性蛋白质数据集的选择。
Protein Sci. 1992 Mar;1(3):409-17. doi: 10.1002/pro.5560010313.