Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina (USC), Discovery 449, 915 Greene St, Columbia, SC 29208, USA.
Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab215.
Copy number variation has been identified as a major source of genomic variation associated with disease susceptibility. With the advent of whole-exome sequencing (WES) technology, massive WES data have been generated, allowing for the identification of copy number variants (CNVs) in the protein-coding regions with direct functional interpretation. We have previously shown evidence of the genomic correlation structure in array data and developed a novel chromosomal breakpoint detection algorithm, LDcnv, which showed significantly improved detection power through integrating the correlation structure in a systematic modeling manner. However, it remains unexplored whether the genomic correlation exists in WES data and how such correlation structure integration can improve the CNV detection accuracy. In this study, we first explored the correlation structure of the WES data using the 1000 Genomes Project data. Both real raw read depth and median-normalized data showed strong evidence of the correlation structure. Motivated by this fact, we proposed a correlation-based method, CORRseq, as a novel release of the LDcnv algorithm in profiling WES data. The performance of CORRseq was evaluated in extensive simulation studies and real data analysis from the 1000 Genomes Project. CORRseq outperformed the existing methods in detecting medium and large CNVs. In conclusion, it would be more advantageous to model genomic correlation structure in detecting relatively long CNVs. This study provides great insights for methodology development of CNV detection with NGS data.
拷贝数变异已被确定为与疾病易感性相关的基因组变异的主要来源。随着外显子组测序(WES)技术的出现,产生了大量的 WES 数据,从而可以在具有直接功能解释的蛋白质编码区域中鉴定拷贝数变异(CNV)。我们之前已经证明了阵列数据中的基因组相关性结构的证据,并开发了一种新颖的染色体断点检测算法 LDcnv,该算法通过以系统建模的方式整合相关性结构,显著提高了检测能力。然而,WES 数据中是否存在基因组相关性以及这种相关性结构集成如何提高 CNV 检测准确性仍未得到探索。在这项研究中,我们首先使用 1000 基因组计划数据探索了 WES 数据的相关性结构。真实的原始读取深度和中位数归一化数据都强烈表明存在相关性结构。受此事实的启发,我们提出了一种基于相关性的方法 CORRseq,作为 LDcnv 算法在 WES 数据分析中的一种新方法。在广泛的模拟研究和来自 1000 基因组计划的真实数据分析中评估了 CORRseq 的性能。CORRseq 在检测中等和大型 CNV 方面优于现有方法。总之,在检测相对较长的 CNV 时,建模基因组相关性结构会更有利。这项研究为使用 NGS 数据进行 CNV 检测的方法学发展提供了重要的见解。