一种新的上下文树推断算法，用于具有应用于生物序列分析的变量长度马尔可夫链模型。

A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

机构信息

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.

Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA.

出版信息

J Comput Biol. 2022 Aug;29(8):839-856. doi: 10.1089/cmb.2021.0604. Epub 2022 Apr 22.

DOI:10.1089/cmb.2021.0604

PMID:35451885

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9419963/

Abstract

The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.

摘要

高阶马尔可夫链（MC）的统计推断对于分子序列分析至关重要，但由于自由参数的高维性，可能会受到阻碍。在 Bühlmann 和 Wyner 的开创性文章中，提出了可变长度马尔可夫链（VLMC）模型，将全阶 MC 嵌入稀疏结构上下文树中。在他们提出的上下文算法的树修剪关键过程中，为每个分支定义了基于单词计数的统计量，并将其与从常见卡方分布计算得出的固定截止阈值进行比较，以修剪上下文树的分支。在这项研究中，我们发现每个分支的单词计数高度相关，对感兴趣的统计量的分布有不可忽略的影响。我们证明了基于 Bühlmann 和 Wyner 原始上下文算法推断的上下文树，该算法使用基于常见卡方分布的固定截止阈值，可能会出现系统偏差和错误。我们将原始上下文算法表示为 VLMC-Biased（VLMC-B）。为了解决这个问题，我们提出了一种新的上下文树推断算法，使用自适应树修剪方案，称为 VLMC-Consistent（VLMC-C）。VLMC-C 基于基于多个单词模式的渐近正态分布计算的一致分支特定混合卡方分布。我们使用模拟数据验证了我们的理论分支特定渐近分布。我们将 VLMC-C 与 VLMC-B 进行比较，使用模拟和真实基因组序列数据进行上下文树推断，并证明 VLMC-C 在上下文树重建准确性和模型压缩能力方面均优于 VLMC-B。

相似文献

A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.一种新的上下文树推断算法，用于具有应用于生物序列分析的变量长度马尔可夫链模型。

J Comput Biol. 2022 Aug;29(8):839-856. doi: 10.1089/cmb.2021.0604. Epub 2022 Apr 22.

Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.基于可变长度马尔可夫链测序特征的无比对转录组和宏转录组比较

Sci Rep. 2016 Nov 23;6:37243. doi: 10.1038/srep37243.

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

A generalization of the PST algorithm: modeling the sparse nature of protein sequences.PST算法的一种推广：对蛋白质序列的稀疏特性进行建模。

Bioinformatics. 2006 Jun 1;22(11):1302-7. doi: 10.1093/bioinformatics/btl088. Epub 2006 Mar 9.

Bayesian coestimation of phylogeny and sequence alignment.系统发育与序列比对的贝叶斯联合估计

BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.

Optimal choice of word length when comparing two Markov sequences using a χ -statistic.使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。

BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.

Modelling heterotachy in phylogenetic inference by reversible-jump Markov chain Monte Carlo.通过可逆跳跃马尔可夫链蒙特卡罗方法在系统发育推断中对异速进行建模。

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3955-64. doi: 10.1098/rstb.2008.0178.

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent.多物种溯祖模型下的高效贝叶斯物种树推断

Syst Biol. 2017 Sep 1;66(5):823-842. doi: 10.1093/sysbio/syw119.

Class of multiple sequence alignment algorithm affects genomic analysis.多序列比对算法的类别会影响基因组分析。

Mol Biol Evol. 2013 Mar;30(3):642-53. doi: 10.1093/molbev/mss256. Epub 2012 Nov 9.

Tail paradox, partial identifiability, and influential priors in Bayesian branch length inference.贝叶斯分支长度推断中的尾部悖论、部分可识别性和有影响的先验。

Mol Biol Evol. 2012 Jan;29(1):325-35. doi: 10.1093/molbev/msr210. Epub 2011 Sep 2.

本文引用的文献

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.基于下一代测序 reads 数据的马尔可夫链转移概率的置信区间

Quant Biol. 2020 Jul 13;8(2):143-154. doi: 10.1007/s40484-020-0200-y. Epub 2020 May 25.

A new coronavirus associated with human respiratory disease in China.一种在中国与人类呼吸道疾病相关的新型冠状病毒。

Nature. 2020 Mar;579(7798):265-269. doi: 10.1038/s41586-020-2008-3. Epub 2020 Feb 3.

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses.一刀切并不适用：关于马尔可夫模型阶数如何决定基因组序列分析的性能。

Nucleic Acids Res. 2013 Feb 1;41(3):1416-24. doi: 10.1093/nar/gks1285. Epub 2012 Dec 24.

Numerical comparison of several approximations of the word count distribution in random sequences.随机序列中词数分布的几种近似值的数值比较。

J Comput Biol. 2001;8(4):349-59. doi: 10.1089/106652701752236179.

Probabilistic and statistical properties of words: an overview.词汇的概率与统计特性：综述

J Comput Biol. 2000 Feb-Apr;7(1-2):1-46. doi: 10.1089/10665270050081360.

Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding.马尔可夫链分析发现，在真核细胞核DNA序列（包括蛋白质编码序列和非编码序列）中，相邻碱基对某一碱基出现的概率有显著影响。

J Mol Evol. 1984;21(3):278-88. doi: 10.1007/BF02102360.

The analysis of intron data and their use in the detection of short signals.内含子数据的分析及其在短信号检测中的应用。

J Mol Evol. 1987;26(4):335-40. doi: 10.1007/BF02101152.

Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis.酵母DNA有义链的单核苷酸至六核苷酸组成：马尔可夫链分析

Nucleic Acids Res. 1988 Jul 25;16(14B):7145-58. doi: 10.1093/nar/16.14.7145.

The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.基于核苷酸子序列的重叠能力，其出现频率的分布情况。

Biometrics. 1989 Mar;45(1):35-52.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验