一种用于加速大规模读计数数据在 CNV 检测中的分析的随机化方法。

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

机构信息

Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175, USA.

Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024, USA.

出版信息

BMC Bioinformatics. 2018 Mar 1;19(1):74. doi: 10.1186/s12859-018-2077-6.

DOI:10.1186/s12859-018-2077-6

PMID:29490610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5831535/

Abstract

BACKGROUND

The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.

RESULTS

We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection.

CONCLUSIONS

Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

摘要

背景

高通量测序在广泛的定量基因组分析（例如 DNA-seq、ChIP-seq）中的应用，对大规模读取计数数据的分析提出了很高的要求。通常，基因组被划分为平铺窗口，并从整个基因组生成窗口化的读取计数数据，从中可以检测到基因组信号（例如 DNA-seq 中的拷贝数变化，ChIP-seq 中的富集峰）。为了准确分析读取计数数据，许多最先进的统计方法使用广义线性模型（GLM）结合负二项式（NB）分布，利用其同时进行偏差校正和信号检测的能力。然而，尽管 GLM+NB 方法在统计学上很强大，但它具有二次计算复杂度，因此在应用于大规模窗口化读取计数数据时运行时间较慢。在这项研究中，我们旨在通过使用随机算法来大大加快 GLM+NB 方法的速度，并通过一个实际示例演示了我们方法在检测拷贝数变异（CNV）中的应用。

结果

我们提出了一种有效的估计器，即随机化 GLM+NB 系数估计器（RGE），用于加快 GLM+NB 方法的速度。RGE 对读取计数数据进行采样，并在较小的规模上解决估计问题。我们首先从理论上验证了 RGE 的一致性和方差特性。然后，我们将 RGE 应用于 GENSENG，这是一种基于 GLM+NB 的用于检测 CNV 的方法。我们将由此产生的方法命名为“R-GENSENG”。通过使用模拟和经验数据进行广泛评估，我们得出结论，R-GENSENG 的速度比原始 GENSENG 快十倍，同时保持 GENSENG 在 CNV 检测中的准确性。

结论

我们的结果表明，这里开发的 RGE 策略可以应用于其他基于 GLM+NB 的读取计数分析，例如 ChIP-seq 数据分析，以大大提高其计算效率，同时保持分析能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16dd/5831535/94df310996be/12859_2018_2077_Fig1_HTML.jpg

相似文献

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

BMC Bioinformatics. 2018 Mar 1;19(1):74. doi: 10.1186/s12859-018-2077-6.

Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation.

Nucleic Acids Res. 2013 Feb 1;41(3):1519-32. doi: 10.1093/nar/gks1363. Epub 2012 Dec 28.

Noise cancellation using total variation for copy number variation detection.

BMC Bioinformatics. 2018 Oct 22;19(Suppl 11):361. doi: 10.1186/s12859-018-2332-x.

CNV-CH: A Convex Hull Based Segmentation Approach to Detect Copy Number Variations (CNV) Using Next-Generation Sequencing Data.

PLoS One. 2015 Aug 20;10(8):e0135895. doi: 10.1371/journal.pone.0135895. eCollection 2015.

Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.

Nucleic Acids Res. 2015 Aug 18;43(14):e90. doi: 10.1093/nar/gkv319. Epub 2015 Apr 16.

Modeling the next generation sequencing read count data for DNA copy number variant study.

Stat Appl Genet Mol Biol. 2015 Aug;14(4):361-74. doi: 10.1515/sagmb-2014-0054.

Read count approach for DNA copy number variants detection.

Bioinformatics. 2012 Feb 15;28(4):470-8. doi: 10.1093/bioinformatics/btr707. Epub 2011 Dec 23.

CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations.

Gigascience. 2017 Dec 1;6(12):1-12. doi: 10.1093/gigascience/gix115.

DeviCNV: detection and visualization of exon-level copy number variants in targeted next-generation sequencing data.

BMC Bioinformatics. 2018 Oct 16;19(1):381. doi: 10.1186/s12859-018-2409-6.

Copy number variation detection using next generation sequencing read counts.

BMC Bioinformatics. 2014 Apr 14;15:109. doi: 10.1186/1471-2105-15-109.

本文引用的文献

IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity.

J Am Stat Assoc. 2015;110(511):975-986. doi: 10.1080/01621459.2015.1040880. Epub 2015 Nov 7.

Big Data: Astronomical or Genomical?

PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.

Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.

Nucleic Acids Res. 2015 Aug 18;43(14):e90. doi: 10.1093/nar/gkv319. Epub 2015 Apr 16.

CODEX: a normalization and copy number variation detection method for whole exome sequencing.

Nucleic Acids Res. 2015 Mar 31;43(6):e39. doi: 10.1093/nar/gku1363. Epub 2015 Jan 23.

LSRN: A PARALLEL ITERATIVE SOLVER FOR STRONGLY OVER- OR UNDERDETERMINED SYSTEMS.

SIAM J Sci Comput. 2014;36(2):C95-C118. doi: 10.1137/120866580.

Robustly detecting differential expression in RNA sequencing data using observation weights.

Nucleic Acids Res. 2014 Jun;42(11):e91. doi: 10.1093/nar/gku310. Epub 2014 Apr 20.

Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation.

Nucleic Acids Res. 2013 Feb 1;41(3):1519-32. doi: 10.1093/nar/gks1363. Epub 2012 Dec 28.

An integrated map of genetic variation from 1,092 human genomes.

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

An integrated encyclopedia of DNA elements in the human genome.

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

Using ERDS to infer copy-number variants in high-coverage genomes.

Am J Hum Genet. 2012 Sep 7;91(3):408-21. doi: 10.1016/j.ajhg.2012.07.004. Epub 2012 Aug 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于加速大规模读计数数据在 CNV 检测中的分析的随机化方法。

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献