CNVind：一个基于覆盖深度的全外显子测序数据中罕见 CNVs 检测的开源云端分析流程。

CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage.

机构信息

Warsaw University of Technology, Institute of Computer Science, Nowowiejska 15/19, 00-665, Warsaw, Poland.

出版信息

BMC Bioinformatics. 2022 Mar 5;23(1):85. doi: 10.1186/s12859-022-04617-x.

DOI:10.1186/s12859-022-04617-x

PMID:35247967

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8897915/

Abstract

BACKGROUND

A typical Copy Number Variations (CNVs) detection process based on the depth of coverage in the Whole Exome Sequencing (WES) data consists of several steps: (I) calculating the depth of coverage in sequencing regions, (II) quality control, (III) normalizing the depth of coverage, (IV) calling CNVs. Previous tools performed one normalization process for each chromosome-all the coverage depths in the sequencing regions from a given chromosome were normalized in a single run.

METHODS

Herein, we present the new CNVind tool for calling CNVs, where the normalization process is conducted separately for each of the sequencing regions. The total number of normalizations is equal to the number of sequencing regions in the investigated dataset. For example, when analyzing a dataset composed of n sequencing regions, CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson's Correlation as distance metric. Then, the resulting subgroup of [Formula: see text] sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset.

RESULTS AND CONCLUSIONS

We used WES data from the 1000 Genomes project to evaluate the impact of independent normalization on CNV calling performance and compared the results with state-of-the-art tools: CODEX and exomeCopy. The results proved that independent normalization allows to improve the rare CNVs detection specificity significantly. For example, for the investigated dataset, we reduced the number of FP calls from over 15,000 to around 5000 while maintaining a constant number of TP calls equal to about 150 CNVs. However, independent normalization of each sequencing region is a computationally expensive process, therefore our pipeline is customized and can be easily run in the cloud computing environment, on the computer cluster, or the single CPU server. To our knowledge, the presented application is the first attempt to implement an innovative approach to independent normalization of the depth of WES data coverage.

摘要

背景

基于全外显子测序（WES）数据覆盖深度的典型拷贝数变异（CNVs）检测过程包括以下几个步骤：（I）计算测序区域的覆盖深度，（II）质量控制，（III）覆盖深度标准化，（IV）CNVs 调用。以前的工具对每条染色体执行一个标准化过程-给定染色体的测序区域中的所有覆盖深度都在单个运行中进行标准化。

方法

本文介绍了用于调用 CNVs 的新 CNVind 工具，其中标准化过程分别针对每个测序区域进行。标准化的总数等于研究数据集的测序区域数。例如，在分析由 n 个测序区域组成的数据集时，CNVind 执行 n 个独立的覆盖深度标准化。在每次标准化之前，应用程序使用皮尔逊相关系数作为距离度量，选择与覆盖深度最相关的 k 个测序区域。然后，将结果的子组[公式：见文本]测序区域进行标准化，对所有 n 个独立的标准化结果进行组合；最后，对组合数据集执行分割和 CNV 调用过程。

结果与结论

我们使用 1000 基因组计划的 WES 数据来评估独立标准化对 CNV 调用性能的影响，并将结果与最先进的工具：CODEX 和 exomeCopy 进行比较。结果证明，独立标准化可以显著提高罕见 CNVs 的检测特异性。例如，对于研究数据集，我们将 FP 调用数量从 15000 多个减少到 5000 个左右，同时保持 TP 调用数量不变，约为 150 个 CNVs。然而，每个测序区域的独立标准化是一个计算成本很高的过程，因此我们的管道是定制的，可以在云计算环境、计算机群或单个 CPU 服务器上轻松运行。据我们所知，所提出的应用程序是第一个尝试实施全外显子测序数据覆盖深度独立标准化的创新方法。