Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175, USA.
Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024, USA.
BMC Bioinformatics. 2018 Mar 1;19(1):74. doi: 10.1186/s12859-018-2077-6.
The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.
We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection.
Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
高通量测序在广泛的定量基因组分析(例如 DNA-seq、ChIP-seq)中的应用,对大规模读取计数数据的分析提出了很高的要求。通常,基因组被划分为平铺窗口,并从整个基因组生成窗口化的读取计数数据,从中可以检测到基因组信号(例如 DNA-seq 中的拷贝数变化,ChIP-seq 中的富集峰)。为了准确分析读取计数数据,许多最先进的统计方法使用广义线性模型(GLM)结合负二项式(NB)分布,利用其同时进行偏差校正和信号检测的能力。然而,尽管 GLM+NB 方法在统计学上很强大,但它具有二次计算复杂度,因此在应用于大规模窗口化读取计数数据时运行时间较慢。在这项研究中,我们旨在通过使用随机算法来大大加快 GLM+NB 方法的速度,并通过一个实际示例演示了我们方法在检测拷贝数变异(CNV)中的应用。
我们提出了一种有效的估计器,即随机化 GLM+NB 系数估计器(RGE),用于加快 GLM+NB 方法的速度。RGE 对读取计数数据进行采样,并在较小的规模上解决估计问题。我们首先从理论上验证了 RGE 的一致性和方差特性。然后,我们将 RGE 应用于 GENSENG,这是一种基于 GLM+NB 的用于检测 CNV 的方法。我们将由此产生的方法命名为“R-GENSENG”。通过使用模拟和经验数据进行广泛评估,我们得出结论,R-GENSENG 的速度比原始 GENSENG 快十倍,同时保持 GENSENG 在 CNV 检测中的准确性。
我们的结果表明,这里开发的 RGE 策略可以应用于其他基于 GLM+NB 的读取计数分析,例如 ChIP-seq 数据分析,以大大提高其计算效率,同时保持分析能力。