• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多竞争有限上下文(马尔可夫)模型对完整基因组的表示能力。

On the representability of complete genomes by multiple competing finite-context (Markov) models.

机构信息

Signal Processing Lab, IEETA/DETI, University of Aveiro, Aveiro, Portugal.

出版信息

PLoS One. 2011;6(6):e21588. doi: 10.1371/journal.pone.0021588. Epub 2011 Jun 30.

DOI:10.1371/journal.pone.0021588
PMID:21738720
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3128062/
Abstract

A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.

摘要

一种有限上下文(马尔可夫)模型的阶 k 给出了符号序列中下一个符号的概率分布,给定最近过去的深度 k。马尔可夫模型长期以来一直应用于 DNA 序列,例如用于找到基因编码区域。随着第一批研究的出现,人们发现 DNA 序列是非平稳的:不同的区域需要不同的模型阶。从那时起,马尔可夫和隐马尔可夫模型被广泛用于描述原核生物和真核生物的基因结构。然而,据我们所知,关于马尔可夫模型描述完整基因组的潜力的综合研究仍然缺乏。我们在本文中解决了这一差距。我们的方法依赖于(i)不同阶的多个竞争马尔可夫模型(ii)允许阶数高达十六的精心编程技术(iii)适当的反转重复处理(iv)适合使用的广泛上下文深度的概率估计。为了衡量模型在序列中特定位置拟合数据的程度,我们使用该位置概率估计的负对数。该度量给出了序列的信息概况,这些概况具有独立的意义。整个序列的平均值,即描述序列所需的平均每个碱基的位数,用作全局性能度量。我们的主要结论是,从概率或信息理论的角度来看,并且根据这个性能度量,多个竞争的马尔可夫模型几乎可以与最先进的 DNA 压缩方法(例如 XM)一样或更好地解释整个基因组,后者依赖于非常不同的统计模型。这令人惊讶,因为马尔可夫模型是局部的(短程的),与其他方法的统计模型形成对比,其他方法探索了 DNA 序列中的广泛数据重复,因此具有非局部特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/3128062/7e5d2f538f2e/pone.0021588.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/3128062/ea30e2ef2aba/pone.0021588.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/3128062/7e5d2f538f2e/pone.0021588.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/3128062/ea30e2ef2aba/pone.0021588.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c879/3128062/7e5d2f538f2e/pone.0021588.g002.jpg

相似文献

1
On the representability of complete genomes by multiple competing finite-context (Markov) models.多竞争有限上下文(马尔可夫)模型对完整基因组的表示能力。
PLoS One. 2011;6(6):e21588. doi: 10.1371/journal.pone.0021588. Epub 2011 Jun 30.
2
Using hidden Markov models to align multiple sequences.使用隐马尔可夫模型对多个序列进行比对。
Cold Spring Harb Protoc. 2009 Jul;2009(7):pdb.top41. doi: 10.1101/pdb.top41.
3
Probabilistic models for biological sequences: selection and Maximum Likelihood estimation.生物序列的概率模型:选择与最大似然估计。
Int J Bioinform Res Appl. 2006;2(3):305-24. doi: 10.1504/IJBRA.2006.010607.
4
Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis.半马尔可夫假设下生物序列中单词的识别
J Comput Biol. 2020 May;27(5):683-697. doi: 10.1089/cmb.2019.0253. Epub 2019 Sep 23.
5
Introduction to Hidden Markov Models and Its Applications in Biology.隐马尔可夫模型简介及其在生物学中的应用
Methods Mol Biol. 2017;1552:1-12. doi: 10.1007/978-1-4939-6753-7_1.
6
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
7
Using hidden Markov models and observed evolution to annotate viral genomes.使用隐马尔可夫模型和观察到的进化对病毒基因组进行注释。
Bioinformatics. 2006 Jun 1;22(11):1308-16. doi: 10.1093/bioinformatics/btl092. Epub 2006 Apr 13.
8
Hidden Markov model variants and their application.隐马尔可夫模型变体及其应用。
BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S14. doi: 10.1186/1471-2105-7-S2-S14.
9
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.使用代表所有已知结构蛋白质的隐马尔可夫模型库将同源性分配给基因组序列。
J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.
10
ToPS: a framework to manipulate probabilistic models of sequence data.ToPS:一个用于操作序列数据概率模型的框架。
PLoS Comput Biol. 2013;9(10):e1003234. doi: 10.1371/journal.pcbi.1003234. Epub 2013 Oct 3.

引用本文的文献

1
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR:生物数据中低复杂度区域的无比对模拟、映射和可视化。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.
2
The complexity landscape of viral genomes.病毒基因组的复杂性景观。
Gigascience. 2022 Aug 11;11. doi: 10.1093/gigascience/giac079.
3
CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

本文引用的文献

1
A genome alignment algorithm based on compression.基于压缩的基因组比对算法。
BMC Bioinformatics. 2010 Dec 16;11:599. doi: 10.1186/1471-2105-11-599.
2
FragGeneScan: predicting genes in short and error-prone reads.FragGeneScan:预测短读和易错读中的基因。
Nucleic Acids Res. 2010 Nov;38(20):e191. doi: 10.1093/nar/gkq747. Epub 2010 Aug 30.
3
Ab initio gene identification in metagenomic sequences.从头鉴定宏基因组序列中的基因。
查包算法:生物分子序列的可能性和分层参考表示及其在多重序列比对压缩中的应用。
PLoS One. 2022 Apr 18;17(4):e0265360. doi: 10.1371/journal.pone.0265360. eCollection 2022.
4
Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model.使用最优阶马尔可夫模型对具有固定算法复杂度的图灵机磁带进行统计复杂性分析。
Entropy (Basel). 2020 Jan 16;22(1):105. doi: 10.3390/e22010105.
5
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较
Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.
6
Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.
7
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.HRCM:一种用于基因组大数据的高效混合参考压缩方法。
Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.
8
LFastqC: A lossless non-reference-based FASTQ compressor.LFastqC:一种无损的非参考型 FASTQ 压缩器。
PLoS One. 2019 Nov 14;14(11):e0224806. doi: 10.1371/journal.pone.0224806. eCollection 2019.
9
Biometric and Emotion Identification: An ECG Compression Based Method.生物特征与情感识别:一种基于心电图压缩的方法。
Front Psychol. 2018 Apr 4;9:467. doi: 10.3389/fpsyg.2018.00467. eCollection 2018.
10
ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data.ChIPWig:一种用于 ChIP-seq 数据的随机访问支持的无损和有损压缩方法。
Bioinformatics. 2018 Mar 15;34(6):911-919. doi: 10.1093/bioinformatics/btx685.
Nucleic Acids Res. 2010 Jul;38(12):e132. doi: 10.1093/nar/gkq275. Epub 2010 Apr 19.
4
Textual data compression in computational biology: a synopsis.计算生物学中的文本数据压缩:概述。
Bioinformatics. 2009 Jul 1;25(13):1575-86. doi: 10.1093/bioinformatics/btp117. Epub 2009 Feb 27.
5
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.
6
Comparative analysis of long DNA sequences by per element information content using different contexts.使用不同上下文,通过每个元件的信息含量对长DNA序列进行比较分析。
BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S10. doi: 10.1186/1471-2105-8-S2-S10.
7
Identifying bacterial genes and endosymbiont DNA with Glimmer.使用Glimmer识别细菌基因和内共生体DNA。
Bioinformatics. 2007 Mar 15;23(6):673-9. doi: 10.1093/bioinformatics/btm009. Epub 2007 Jan 19.
8
Gene prediction with a hidden Markov model and a new intron submodel.基于隐马尔可夫模型和新型内含子子模型的基因预测
Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.
9
SPA: Simple web tool to assess statistical significance of DNA patterns.SPA:用于评估DNA模式统计学显著性的简单网络工具。
Nucleic Acids Res. 2003 Jul 1;31(13):3679-81. doi: 10.1093/nar/gkg613.
10
SIC: A tool to detect short inverted segments in a biological sequence.SIC:一种检测生物序列中短反向片段的工具。
Nucleic Acids Res. 2003 Jul 1;31(13):3669-71. doi: 10.1093/nar/gkg596.