Suppr超能文献

快速上下文感知的基因组注释共定位分析。

Fast Context-Aware Analysis of Genome Annotation Colocalization.

机构信息

Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia.

LIRMM, University of Montpellier, Montpellier, France.

出版信息

J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating -values by computing the exact expectation and variance of the test statistic and then estimating the -value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed -values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.

摘要

注释是指具有特定功能或属性的基因组区间集。例如基因或其外显子、序列重复、具有特定表观遗传状态的区域和拷贝数变异。常见的任务是比较两个注释,以确定另一个在另一个注释覆盖的区域中是否富集或缺失。我们研究了基于代表随机无关注释的 null 模型为这种比较分配统计显著性的问题。为了将更多背景信息纳入此类分析中,我们提出了一种新的基于马尔可夫链的 null 模型,该模型可以区分几种基因组上下文。这些上下文可以捕获各种混杂因素,例如 GC 含量或组装间隙。然后,我们开发了一种新的算法,通过计算测试统计量的精确期望和方差来估计 - 值,然后使用正态逼近来估计 - 值。与 Gafurov 等人之前的算法相比,新算法具有三个优势:(1) 运行时间从二次改进为线性或准线性,(2) 算法可以处理两种不同的测试统计量,(3) 算法可以处理简单和依赖上下文的马尔可夫链 null 模型。我们在合成数据集和真实数据集上展示了我们算法的效率和准确性,包括最近的人类端粒到端粒组装。特别是,我们的算法在不到三个小时的时间内使用 24 个线程为 450 对人类基因组注释计算了 - 值。此外,使用基因组上下文来纠正 GC 偏差导致了一些先前发表的发现的反转。

相似文献

1
Fast Context-Aware Analysis of Genome Annotation Colocalization.快速上下文感知的基因组注释共定位分析。
J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.
2
Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.考虑基因组背景的注释共定位的高效分析
Res Comput Mol Biol. 2024 Apr-May;14758:38-53. doi: 10.1007/978-1-0716-3989-4_3. Epub 2024 May 17.

本文引用的文献

1
The UCSC Genome Browser database: 2023 update.UCSC 基因组浏览器数据库:2023 年更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1188-D1195. doi: 10.1093/nar/gkac1072.
3
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
4
Epigenetic patterns in a complete human genome.人类全基因组中的表观遗传模式。
Science. 2022 Apr;376(6588):eabj5089. doi: 10.1126/science.abj5089. Epub 2022 Apr 1.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验