Suppr超能文献

马尔可夫链提高重叠基因组注释的显著性计算。

Markov chains improve the significance computation of overlapping genome annotations.

机构信息

Department of Computer Science, Comenius University, Bratislava 84248, Slovakia.

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.

出版信息

Bioinformatics. 2022 Jun 24;38(Suppl 1):i203-i211. doi: 10.1093/bioinformatics/btac255.

Abstract

MOTIVATION

Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome.

RESULTS

We show that finding the P-values under the typically used 'gold' null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy.

AVAILABILITY AND IMPLEMENTATION

The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基因组注释是表示基因、调控元件或表观遗传修饰等基因组特征的常用方法。两个注释之间的重叠量通常用于确定它们之间是否存在潜在的生物学联系。为了区分真正的生物学关联和纯粹偶然的重叠,需要一个稳健的显著性度量。一种常见的方法是确定参考注释中与查询注释相交的区间数量是否具有统计学意义。然而,当在整个人类基因组的规模上计算 P 值时,当前使用的统计框架通常效率低下或不准确。

结果

我们表明,在通常使用的“黄金”零假设下找到 P 值是 NP 难的。这促使我们使用马尔可夫链重新制定零假设。为了能够衡量我们的马尔可夫零假设的保真度,我们开发了一种快速直接的抽样算法来估计黄金零假设下的 P 值。然后,我们提出了一个开源软件工具 MCDP,它可以在 O(m2+n)的时间和 O(m)的内存中计算马尔可夫零假设下的 P 值,其中 m 和 n 分别是参考和查询注释中的区间数量。值得注意的是,MCDP 的运行时间和内存使用与基因组长度无关,使其在人类基因组注释方面的运行时间和内存使用方面的性能比以前的方法提高了几个数量级,同时保持相同的准确性。

可用性和实现

该软件可在 https://github.com/fmfi-compbio/mc-overlaps 上获得。所有可重现性数据均可在 https://github.com/fmfi-compbio/mc-overlaps-reproducibility 上获得。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7a8/9235476/51a8afb62d8a/btac255f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验