• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DNA序列的马尔可夫分析。

A Markov analysis of DNA sequences.

作者信息

Almagor H

出版信息

J Theor Biol. 1983 Oct 21;104(4):633-45. doi: 10.1016/0022-5193(83)90251-5.

DOI:10.1016/0022-5193(83)90251-5
PMID:6316035
Abstract

We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.

摘要

我们提出了一个将DNA序列视为马尔可夫过程的模型。一些研究人员认为,这些链中二核苷酸(双联体)的频率背后存在着核酸的某些基本生物学或化学特征。比较不同生物体DNA中的双联体频率模式已被证明是解决一些系统发育问题的有效方法(拉塞尔和苏巴克-夏普,1977年)。格兰瑟姆(1978年)制定了mRNA序列指数,其中一些涉及特定的双联体频率。他认为使用这些指数可能会提供基因进化过程中存在的分子限制的迹象。努西诺夫(1981年)表明,一组二核苷酸偏好规则在真核生物中始终成立,并提出这些规则与简并密码子使用之间存在很强的相关性。格鲁恩鲍姆、雪松和拉津(1982年)发现真核生物DNA中的甲基化仅发生在C-G位点。因此,重要的生物学信息似乎包含在双联体频率中。要问的一个基本问题(“相关性问题”)是,在一个序列中测量的64种三核苷酸(三联体)频率在多大程度上由同一序列中的16种双联体频率决定。这里将DNA描述为一个马尔可夫过程,核苷酸是序列发生器的结果。回答上述相关性问题意味着找到马尔可夫过程的阶数。困难在于自然序列长度有限,统计噪声相当大。我们表明,即使对于16000个核苷酸长的序列(如人类线粒体基因组序列),有限长度效应也不能被忽视。然而,使用马尔可夫链模型,即使对于有限序列,在适当考虑有限长度的情况下,也可以确定双联体和三联体频率之间的相关性。作为该方法的示例,分析了两个人类自然DNA序列,即人类线粒体基因组和SV40 DNA。

相似文献

1
A Markov analysis of DNA sequences.DNA序列的马尔可夫分析。
J Theor Biol. 1983 Oct 21;104(4):633-45. doi: 10.1016/0022-5193(83)90251-5.
2
Doublet frequencies in evolutionary distinct groups.进化上不同群体中的双重频率。
Nucleic Acids Res. 1984 Feb 10;12(3):1749-63. doi: 10.1093/nar/12.3.1749.
3
Strong doublet preferences in nucleotide sequences and DNA geometry.核苷酸序列和DNA几何结构中强烈的双峰偏好。
J Mol Evol. 1984;20(2):111-9. doi: 10.1007/BF02257371.
4
Recognition sites of eukaryotic DNA topoisomerase I: DNA nucleotide sequencing analysis of topo I cleavage sites on SV40 DNA.真核生物DNA拓扑异构酶I的识别位点:SV40 DNA上拓扑异构酶I切割位点的DNA核苷酸测序分析
Nucleic Acids Res. 1982 Apr 24;10(8):2565-76. doi: 10.1093/nar/10.8.2565.
5
Deviations from expected frequencies of CpG dinucleotides in herpesvirus DNAs may be diagnostic of differences in the states of their latent genomes.疱疹病毒DNA中CpG二核苷酸预期频率的偏差可能有助于诊断其潜伏基因组状态的差异。
J Gen Virol. 1989 Apr;70 ( Pt 4):837-55. doi: 10.1099/0022-1317-70-4-837.
6
Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: the role of mixing statistics and frame shift of neighboring genes.缺乏长程相关性的细菌基因组可能无法用低阶马尔可夫链建模:混合统计和相邻基因移码的作用。
Comput Biol Chem. 2014 Dec;53 Pt A:15-25. doi: 10.1016/j.compbiolchem.2014.08.005. Epub 2014 Aug 30.
7
Nucleotide sequence of the Hind-C fragment of simian virus 40 DNA. Comparison of the 5'-untranslated region of wild-type virus and of some deletion Mutants.猴病毒40型DNA的Hind-C片段的核苷酸序列。野生型病毒与一些缺失突变体的5'-非翻译区的比较。
Eur J Biochem. 1979 Oct;100(1):51-60. doi: 10.1111/j.1432-1033.1979.tb02032.x.
8
The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis.密码子使用对大肠杆菌基因组寡核苷酸组成的影响以及通过马尔可夫链分析鉴定过度和低度代表序列。
Nucleic Acids Res. 1987 Mar 25;15(6):2627-38. doi: 10.1093/nar/15.6.2627.
9
Nucleotide sequence of the restriction fragment Hind-F-EcoRI1 of simian-virus-40 DNA (part of the VP1 gene).猴病毒40型DNA的Hind-F-EcoRI1限制性片段的核苷酸序列(VP1基因的一部分)
Eur J Biochem. 1978 May 16;86(2):317-24. doi: 10.1111/j.1432-1033.1978.tb12313.x.
10
The early region of SV40 DNA may have more than one gene.猴空泡病毒40型DNA的早期区域可能有不止一个基因。
Cell. 1977 Aug;11(4):837-43. doi: 10.1016/0092-8674(77)90295-1.

引用本文的文献

1
Detection and evaluation of clusters within sequential data.序列数据中聚类的检测与评估。
Data Min Knowl Discov. 2025;39(6):69. doi: 10.1007/s10618-025-01140-4. Epub 2025 Aug 14.
2
MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences.MLR-OOD:基于马尔可夫链的基因组序列分布外检测似然比方法。
J Mol Biol. 2022 Aug 15;434(15):167586. doi: 10.1016/j.jmb.2022.167586. Epub 2022 Apr 12.
3
Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.
基于下一代测序 reads 数据的马尔可夫链转移概率的置信区间
Quant Biol. 2020 Jul 13;8(2):143-154. doi: 10.1007/s40484-020-0200-y. Epub 2020 May 25.
4
KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate.KIMI:具有控制假发现率的分子序列 motif 识别的仿射推理。
Bioinformatics. 2021 May 5;37(6):759-766. doi: 10.1093/bioinformatics/btaa912.
5
Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用
Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.
6
Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。
Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.
7
A Markovian analysis of bacterial genome sequence constraints.细菌基因组序列约束的马尔可夫分析。
PeerJ. 2013 Aug 29;1:e127. doi: 10.7717/peerj.127. eCollection 2013.
8
Uniform Accuracy of the Maximum Likelihood Estimates for Probabilistic Models of Biological Sequences.生物序列概率模型最大似然估计的一致准确性
Methodol Comput Appl Probab. 2011 Mar 1;13(1):105-120. doi: 10.1007/s11009-009-9125-7.
9
Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures.热病毒宏基因组的基因组特征分析揭示了古菌和嗜热特征。
BMC Genomics. 2008 Sep 17;9:420. doi: 10.1186/1471-2164-9-420.
10
Diversity of the abundant pKLC102/PAGI-2 family of genomic islands in Pseudomonas aeruginosa.铜绿假单胞菌中丰富的pKLC102/PAGI-2基因组岛家族的多样性。
J Bacteriol. 2007 Mar;189(6):2443-59. doi: 10.1128/JB.01688-06. Epub 2006 Dec 28.