Suppr超能文献

一种用于加速大规模读计数数据在 CNV 检测中的分析的随机化方法。

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

机构信息

Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175, USA.

Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024, USA.

出版信息

BMC Bioinformatics. 2018 Mar 1;19(1):74. doi: 10.1186/s12859-018-2077-6.

Abstract

BACKGROUND

The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.

RESULTS

We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection.

CONCLUSIONS

Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

摘要

背景

高通量测序在广泛的定量基因组分析(例如 DNA-seq、ChIP-seq)中的应用,对大规模读取计数数据的分析提出了很高的要求。通常,基因组被划分为平铺窗口,并从整个基因组生成窗口化的读取计数数据,从中可以检测到基因组信号(例如 DNA-seq 中的拷贝数变化,ChIP-seq 中的富集峰)。为了准确分析读取计数数据,许多最先进的统计方法使用广义线性模型(GLM)结合负二项式(NB)分布,利用其同时进行偏差校正和信号检测的能力。然而,尽管 GLM+NB 方法在统计学上很强大,但它具有二次计算复杂度,因此在应用于大规模窗口化读取计数数据时运行时间较慢。在这项研究中,我们旨在通过使用随机算法来大大加快 GLM+NB 方法的速度,并通过一个实际示例演示了我们方法在检测拷贝数变异(CNV)中的应用。

结果

我们提出了一种有效的估计器,即随机化 GLM+NB 系数估计器(RGE),用于加快 GLM+NB 方法的速度。RGE 对读取计数数据进行采样,并在较小的规模上解决估计问题。我们首先从理论上验证了 RGE 的一致性和方差特性。然后,我们将 RGE 应用于 GENSENG,这是一种基于 GLM+NB 的用于检测 CNV 的方法。我们将由此产生的方法命名为“R-GENSENG”。通过使用模拟和经验数据进行广泛评估,我们得出结论,R-GENSENG 的速度比原始 GENSENG 快十倍,同时保持 GENSENG 在 CNV 检测中的准确性。

结论

我们的结果表明,这里开发的 RGE 策略可以应用于其他基于 GLM+NB 的读取计数分析,例如 ChIP-seq 数据分析,以大大提高其计算效率,同时保持分析能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16dd/5831535/94df310996be/12859_2018_2077_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验