Xie Kun, Tian Ye, Yuan Xiguo
The School of Computer Science and Technology, Xidian University, Xi'an, China.
Xi'an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi'an, China.
Front Genet. 2021 Jan 13;11:632311. doi: 10.3389/fgene.2020.632311. eCollection 2020.
Copy number variation (CNV) is a common type of structural variations in human genome and confers biological meanings to human complex diseases. Detection of CNVs is an important step for a systematic analysis of CNVs in medical research of complex diseases. The recent development of next-generation sequencing (NGS) platforms provides unprecedented opportunities for the detection of CNVs at a base-level resolution. However, due to the intrinsic characteristics behind NGS data, accurate detection of CNVs is still a challenging task. In this article, we propose a new density peak-based method, called dpCNV, for the detection of CNVs from NGS data. The algorithm of dpCNV is designed based on density peak clustering algorithm. It extracts two features, i.e., local density and minimum distance, from sequencing read depth (RD) profile and generates a two-dimensional data. Based on the generated data, a two-dimensional null distribution is constructed to test the significance of each genome bin and then the significant genome bins are declared as CNVs. We test the performance of the dpCNV method on a number of simulated datasets and make comparison with several existing methods. The experimental results demonstrate that our proposed method outperforms others in terms of sensitivity and F1-score. We further apply it to a set of real sequencing samples and the results demonstrate the validity of dpCNV. Therefore, we expect that dpCNV can be used as a supplementary to existing methods and may become a routine tool in the field of genome mutation analysis.
拷贝数变异(CNV)是人类基因组中常见的一种结构变异类型,赋予人类复杂疾病生物学意义。检测CNV是复杂疾病医学研究中对CNV进行系统分析的重要一步。新一代测序(NGS)平台的最新发展为以碱基水平分辨率检测CNV提供了前所未有的机会。然而,由于NGS数据背后的内在特性,准确检测CNV仍然是一项具有挑战性的任务。在本文中,我们提出了一种基于密度峰的新方法,称为dpCNV,用于从NGS数据中检测CNV。dpCNV算法基于密度峰聚类算法设计。它从测序读深度(RD)谱中提取两个特征,即局部密度和最小距离,并生成二维数据。基于生成的数据,构建二维零分布以检验每个基因组区间的显著性,然后将显著的基因组区间声明为CNV。我们在多个模拟数据集上测试了dpCNV方法的性能,并与几种现有方法进行了比较。实验结果表明,我们提出的方法在灵敏度和F1分数方面优于其他方法。我们进一步将其应用于一组真实测序样本,结果证明了dpCNV的有效性。因此,我们期望dpCNV可以作为现有方法的补充,并可能成为基因组突变分析领域的常规工具。