通过最大耦合近似逗留时间：基序频率分布

Approximation of sojourn-times via maximal couplings: motif frequency distributions.

作者信息

Lladser Manuel E, Chestnut Stephen R

机构信息

Department of Applied Mathematics, University of Colorado, Boulder, CO, 80309-0526, USA,

出版信息

J Math Biol. 2014 Jul;69(1):147-82. doi: 10.1007/s00285-013-0690-6. Epub 2013 Jun 6.

DOI:10.1007/s00285-013-0690-6

PMID:23739838

Abstract

Sojourn-times provide a versatile framework to assess the statistical significance of motifs in genome-wide searches even under non-Markovian background models. However, the large state spaces encountered in genomic sequence analyses make the exact calculation of sojourn-time distributions computationally intractable in long sequences. Here, we use coupling and analytic combinatoric techniques to approximate these distributions in the general setting of Polish state spaces, which encompass discrete state spaces. Our approximations are accompanied with explicit, easy to compute, error bounds for total variation distance. Broadly speaking, if Tn is the random number of times a Markov chain visits a certain subset T of states in its first n transitions, then we can usually approximate the distribution of Tn for n of order (1 − α)(−m), where m is the largest integer for which the exact distribution of Tm is accessible and 0 ≤ α ≤ 1 is an ergodicity coefficient associated with the probability transition kernel of the chain. This gives access to approximations of sojourn-times in the intermediate regime where n is perhaps too large for exact calculations, but too small to rely on Normal approximations or stationarity assumptions underlying Poisson and compound Poisson approximations. As proof of concept, we approximate the distribution of the number of matches with a motif in promoter regions of C.

摘要

逗留时间提供了一个通用框架，即使在非马尔可夫背景模型下，也能在全基因组搜索中评估基序的统计显著性。然而，基因组序列分析中遇到的大状态空间使得在长序列中精确计算逗留时间分布在计算上难以处理。在这里，我们使用耦合和解析组合技术在波兰状态空间的一般设置中近似这些分布，波兰状态空间包括离散状态空间。我们的近似值伴随着总变差距离的明确、易于计算的误差界。广义地说，如果(T_n)是马尔可夫链在前(n)次转移中访问某个状态子集(T)的随机次数，那么我们通常可以近似(T_n)的分布，其中(n)的阶数为((1 - \alpha)(-m))，这里(m)是可获得(T_m)精确分布的最大整数，并且(0 \leq \alpha \leq 1)是与链的概率转移核相关的遍历性系数。这使得我们能够在中间区域近似逗留时间，在该区域中，(n)可能太大而无法进行精确计算，但又太小而无法依赖正态近似或泊松和复合泊松近似所基于的平稳性假设。作为概念验证，我们近似了与秀丽隐杆线虫启动子区域中一个基序匹配数的分布。