Institut Pasteur - Pole Biomics - 25-28 Rue du Docteur Roux, 75015 Paris, France.
Institut Pasteur - Bioinformatics and Biostatistics Hub - C3BI, USR 3756 IP CNRS - Paris, France.
Gigascience. 2018 Dec 1;7(12):giy110. doi: 10.1093/gigascience/giy110.
In addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs). It is essential to take into consideration atypical regions, trends (e.g., origin of replication), or known and unknown biases that influence coverage. It is also important that reported events have robust statistics (e.g. z-score) associated with their detections as well as precise location.
We provide a stand-alone application, sequana_coverage, that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data. Significance is associated with the events as well as characteristics such as length of the regions. The algorithm first detrends the data using an efficient running median algorithm. It then estimates the distribution of the normalized genome coverage with a Gaussian mixture model. Finally, a z-score statistic is assigned to each base position and used to separate the central distribution from the ROIs (i.e., under- and overcovered regions). A double thresholds mechanism is used to cluster the genomic ROIs. HTML reports provide a summary with interactive visual representations of the genomic ROIs with standard plots and metrics. Genomic variations such as single-nucleotide variants or CNVs can be effectively identified at the same time.
除了映射质量信息外,基因组覆盖度还包含有价值的生物学信息,如重复区域、缺失基因或拷贝数变异(CNVs)的存在。考虑到影响覆盖度的非典型区域、趋势(例如复制起点)或已知和未知的偏差是至关重要的。同样重要的是,报告的事件具有与其检测相关的稳健统计数据(例如 z 分数)以及精确的位置。
我们提供了一个独立的应用程序 sequana_coverage,用于报告在高通量测序数据中显著过表达或低表达的基因组感兴趣区域(ROI)。显著性与事件以及 ROI 的长度等特征相关联。该算法首先使用高效的移动中位数算法对数据进行去趋势处理。然后,它使用高斯混合模型估计归一化基因组覆盖度的分布。最后,为每个碱基位置分配 z 分数统计量,并将其用于将中央分布与 ROI(即覆盖不足和覆盖过度的区域)分开。使用双阈值机制对基因组 ROI 进行聚类。HTML 报告提供了带有基因组 ROI 的交互式可视化表示的摘要,以及标准的图表和指标。同时可以有效地识别基因组变异,如单核苷酸变异或 CNVs。