Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.
Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA.
J Comput Biol. 2022 Aug;29(8):839-856. doi: 10.1089/cmb.2021.0604. Epub 2022 Apr 22.
The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.
高阶马尔可夫链(MC)的统计推断对于分子序列分析至关重要,但由于自由参数的高维性,可能会受到阻碍。在 Bühlmann 和 Wyner 的开创性文章中,提出了可变长度马尔可夫链(VLMC)模型,将全阶 MC 嵌入稀疏结构上下文树中。在他们提出的上下文算法的树修剪关键过程中,为每个分支定义了基于单词计数的统计量,并将其与从常见卡方分布计算得出的固定截止阈值进行比较,以修剪上下文树的分支。在这项研究中,我们发现每个分支的单词计数高度相关,对感兴趣的统计量的分布有不可忽略的影响。我们证明了基于 Bühlmann 和 Wyner 原始上下文算法推断的上下文树,该算法使用基于常见卡方分布的固定截止阈值,可能会出现系统偏差和错误。我们将原始上下文算法表示为 VLMC-Biased(VLMC-B)。为了解决这个问题,我们提出了一种新的上下文树推断算法,使用自适应树修剪方案,称为 VLMC-Consistent(VLMC-C)。VLMC-C 基于基于多个单词模式的渐近正态分布计算的一致分支特定混合卡方分布。我们使用模拟数据验证了我们的理论分支特定渐近分布。我们将 VLMC-C 与 VLMC-B 进行比较,使用模拟和真实基因组序列数据进行上下文树推断,并证明 VLMC-C 在上下文树重建准确性和模型压缩能力方面均优于 VLMC-B。