Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia.
LIRMM, University of Montpellier, Montpellier, France.
J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.
An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating -values by computing the exact expectation and variance of the test statistic and then estimating the -value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed -values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.
注释是指具有特定功能或属性的基因组区间集。例如基因或其外显子、序列重复、具有特定表观遗传状态的区域和拷贝数变异。常见的任务是比较两个注释,以确定另一个在另一个注释覆盖的区域中是否富集或缺失。我们研究了基于代表随机无关注释的 null 模型为这种比较分配统计显著性的问题。为了将更多背景信息纳入此类分析中,我们提出了一种新的基于马尔可夫链的 null 模型,该模型可以区分几种基因组上下文。这些上下文可以捕获各种混杂因素,例如 GC 含量或组装间隙。然后,我们开发了一种新的算法,通过计算测试统计量的精确期望和方差来估计 - 值,然后使用正态逼近来估计 - 值。与 Gafurov 等人之前的算法相比,新算法具有三个优势:(1) 运行时间从二次改进为线性或准线性,(2) 算法可以处理两种不同的测试统计量,(3) 算法可以处理简单和依赖上下文的马尔可夫链 null 模型。我们在合成数据集和真实数据集上展示了我们算法的效率和准确性,包括最近的人类端粒到端粒组装。特别是,我们的算法在不到三个小时的时间内使用 24 个线程为 450 对人类基因组注释计算了 - 值。此外,使用基因组上下文来纠正 GC 偏差导致了一些先前发表的发现的反转。