Zhang Tong, Dong Jinxin, Jiang Hua, Zhao Zuyao, Zhou Mengjiao, Yuan Tianting
School of Computer Science and Technology, Liaocheng University, Liaocheng, China.
College of Clinical Medicine, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China.
Front Bioeng Biotechnol. 2022 Dec 1;10:1000638. doi: 10.3389/fbioe.2022.1000638. eCollection 2022.
Copy number variations (CNVs) significantly influence the diversity of the human genome and the occurrence of many complex diseases. The next-generation sequencing (NGS) technology provides rich data for detecting CNVs, and the read depth (RD)-based approach is widely used. However, low CN (copy number of 3-4) duplication events are challenging to identify with existing methods, especially when the size of CNVs is small. In addition, the RD-based approach can only obtain rough breakpoints. We propose a new method, CNV-PCC (detection of CNVs based on Principal Component Classifier), to identify CNVs in whole genome sequencing data. CNV-PPC first uses the split read signal to search for potential breakpoints. A two-stage segmentation strategy is then implemented to enhance the identification capabilities of low CN duplications and small CNVs. Next, the outlier scores are calculated for each segment by PCC (Principal Component Classifier). Finally, the OTSU algorithm calculates the threshold to determine the CNVs regions. The analysis of simulated data results indicates that CNV-PCC outperforms the other methods for sensitivity and F1-score and improves breakpoint accuracy. Furthermore, CNV-PCC shows high consistency on real sequencing samples with other methods. This study demonstrates that CNV-PCC is an effective method for detecting CNVs, even for low CN duplications and small CNVs.
拷贝数变异(CNVs)显著影响人类基因组的多样性以及许多复杂疾病的发生。下一代测序(NGS)技术为检测CNVs提供了丰富的数据,基于读深度(RD)的方法被广泛使用。然而,低拷贝数(拷贝数为3 - 4)的重复事件使用现有方法难以识别,特别是当CNVs的大小较小时。此外,基于RD的方法只能获得大致的断点。我们提出了一种新的方法,即CNV - PCC(基于主成分分类器的CNVs检测方法),用于在全基因组测序数据中识别CNVs。CNV - PPC首先利用分裂读信号搜索潜在断点。然后实施两阶段分割策略以增强对低拷贝数重复和小CNVs的识别能力。接下来,通过主成分分类器(PCC)为每个片段计算异常值分数。最后,使用大津算法计算阈值以确定CNVs区域。模拟数据结果分析表明,CNV - PCC在灵敏度和F1分数方面优于其他方法,并提高了断点准确性。此外,CNV - PCC与其他方法在真实测序样本上具有高度一致性。这项研究表明,CNV - PCC是一种检测CNVs的有效方法,即使对于低拷贝数重复和小CNVs也是如此。