CNOGpro：用于检测和定量分析原核基因组测序数据中 CNVs 的工具。

CNOGpro: detection and quantification of CNVs in prokaryotic whole-genome sequencing data.

机构信息

Section for Biostatistics and Epidemiology, Norwegian University of Life Sciences (NMBU), Oslo, Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås and Norwegian Institute of Public Health, Division of Epidemiology, 0403 Oslo, Norway.

出版信息

Bioinformatics. 2015 Jun 1;31(11):1708-15. doi: 10.1093/bioinformatics/btv070. Epub 2015 Feb 1.

DOI:10.1093/bioinformatics/btv070

PMID:25644268

Abstract

MOTIVATION

The explosion of whole-genome sequencing (WGS) as a tool in the mapping and understanding of genomes has been accompanied by an equally massive report of tools and pipelines for the analysis of DNA copy number variation (CNV). Most currently available tools are designed specifically for human genomes, with comparatively little literature devoted to CNVs in prokaryotic organisms. However, there are several idiosyncrasies in prokaryotic WGS data. This work proposes a step-by-step approach for detection and quantification of copy number variants specifically aimed at prokaryotes.

RESULTS

After aligning WGS reads to a reference genome, we count the individual reads in a sliding window and normalize these counts for bias introduced by differences in GC content. We then investigate the coverage in two fundamentally different ways: (i) Employing a Hidden Markov Model and (ii) by repeated sampling with replacement (bootstrapping) on each individual gene. The latter bypasses the complex problem of breakpoint determination. To demonstrate our method, we apply it to real and simulated WGS data and benchmark it against two popular methods for CNV detection. The proposed methodology will in some cases represent a significant jump in accuracy from other current methods.

AVAILABILITY AND IMPLEMENTATION

CNOGpro is written entirely in the R programming language and is available from the CRAN repository (http://cran.r-project.org) under the GNU General Public License.

摘要

动机

全基因组测序（WGS）作为一种用于绘制和理解基因组的工具，其应用已经爆炸式增长，与此同时，用于分析 DNA 拷贝数变异（CNV）的工具和管道也同样大量涌现。大多数现有的工具都是专门为人类基因组设计的，关于原核生物中的 CNV 研究相对较少。然而，原核 WGS 数据存在一些特殊性。本研究提出了一种针对原核生物的拷贝数变异检测和定量的分步方法。

结果

在将 WGS 读取与参考基因组对齐后，我们在滑动窗口中计算每个读取的数量，并对由于 GC 含量差异导致的偏差进行标准化。然后，我们从两种完全不同的方法来研究覆盖度：（i）使用隐马尔可夫模型和（ii）对每个基因进行重复抽样替换（自举）。后一种方法避免了复杂的断点确定问题。为了演示我们的方法，我们将其应用于真实和模拟的 WGS 数据，并将其与两种流行的 CNV 检测方法进行基准测试。在某些情况下，与其他当前方法相比，所提出的方法在准确性上会有显著提高。