快速上下文感知的基因组注释共定位分析。

Fast Context-Aware Analysis of Genome Annotation Colocalization.

机构信息

Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia.

LIRMM, University of Montpellier, Montpellier, France.

出版信息

J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.

DOI:10.1089/cmb.2024.0667

PMID:39381845

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11698669/

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating -values by computing the exact expectation and variance of the test statistic and then estimating the -value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed -values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.

摘要

注释是指具有特定功能或属性的基因组区间集。例如基因或其外显子、序列重复、具有特定表观遗传状态的区域和拷贝数变异。常见的任务是比较两个注释，以确定另一个在另一个注释覆盖的区域中是否富集或缺失。我们研究了基于代表随机无关注释的 null 模型为这种比较分配统计显著性的问题。为了将更多背景信息纳入此类分析中，我们提出了一种新的基于马尔可夫链的 null 模型，该模型可以区分几种基因组上下文。这些上下文可以捕获各种混杂因素，例如 GC 含量或组装间隙。然后，我们开发了一种新的算法，通过计算测试统计量的精确期望和方差来估计 - 值，然后使用正态逼近来估计 - 值。与 Gafurov 等人之前的算法相比，新算法具有三个优势：(1) 运行时间从二次改进为线性或准线性，(2) 算法可以处理两种不同的测试统计量，(3) 算法可以处理简单和依赖上下文的马尔可夫链 null 模型。我们在合成数据集和真实数据集上展示了我们算法的效率和准确性，包括最近的人类端粒到端粒组装。特别是，我们的算法在不到三个小时的时间内使用 24 个线程为 450 对人类基因组注释计算了 - 值。此外，使用基因组上下文来纠正 GC 偏差导致了一些先前发表的发现的反转。

相似文献

Fast Context-Aware Analysis of Genome Annotation Colocalization.快速上下文感知的基因组注释共定位分析。

J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.

Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.考虑基因组背景的注释共定位的高效分析

Res Comput Mol Biol. 2024 Apr-May;14758:38-53. doi: 10.1007/978-1-0716-3989-4_3. Epub 2024 May 17.

Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.考虑基因组背景的注释共定位的高效分析

bioRxiv. 2024 May 20:2023.11.22.568259. doi: 10.1101/2023.11.22.568259.

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗？

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

本文引用的文献

The UCSC Genome Browser database: 2023 update.UCSC 基因组浏览器数据库：2023 年更新。

Nucleic Acids Res. 2023 Jan 6;51(D1):D1188-D1195. doi: 10.1093/nar/gkac1072.

Markov chains improve the significance computation of overlapping genome annotations.马尔可夫链提高重叠基因组注释的显著性计算。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i203-i211. doi: 10.1093/bioinformatics/btac255.

The complete sequence of a human genome.人类基因组的完整序列。

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

Epigenetic patterns in a complete human genome.人类全基因组中的表观遗传模式。

Science. 2022 Apr;376(6588):eabj5089. doi: 10.1126/science.abj5089. Epub 2022 Apr 1.

Computing the Statistical Significance of Overlap between Genome Annotations with iStat.使用 iStat 计算基因组注释之间的重叠的统计显著性。

Cell Syst. 2019 Jun 26;8(6):523-529.e4. doi: 10.1016/j.cels.2019.05.006. Epub 2019 Jun 12.

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis.注意空白：在基因组分析中忽略难以获取的区域会使统计检验产生偏差。

BMC Bioinformatics. 2018 Dec 14;19(1):481. doi: 10.1186/s12859-018-2438-1.

Colocalization analyses of genomic elements: approaches, recommendations and challenges.基因组元件的共定位分析：方法、建议和挑战。

Bioinformatics. 2019 May 1;35(9):1615-1624. doi: 10.1093/bioinformatics/bty835.

GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets.GenomeRunner网络服务器：调控的相似性与差异决定了单核苷酸多态性（SNP）集的功能影响。

Bioinformatics. 2016 Aug 1;32(15):2256-63. doi: 10.1093/bioinformatics/btw169. Epub 2016 Apr 1.

LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor.LOLA：R和Bioconductor中基因组区域集和调控元件的富集分析。

Bioinformatics. 2016 Feb 15;32(4):587-9. doi: 10.1093/bioinformatics/btv612. Epub 2015 Oct 27.

regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests.regioneR：一个用于基于排列检验进行基因组区域关联分析的R/Bioconductor软件包。

Bioinformatics. 2016 Jan 15;32(2):289-91. doi: 10.1093/bioinformatics/btv562. Epub 2015 Sep 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验