Liu Guojun, Yang Hongzhi, Yuan Xiguo
School of Statistics, Xi'an University of Finance and Economics, Xi'an, China.
Medical Imaging Center, Xidian Group Hospital, Xi'an, China.
Front Genet. 2023 Jan 17;13:1084974. doi: 10.3389/fgene.2022.1084974. eCollection 2022.
Copy number variation (CNV) is one of the main structural variations in the human genome and accounts for a considerable proportion of variations. As CNVs can directly or indirectly cause cancer, mental illness, and genetic disease in humans, their effective detection in humans is of great interest in the fields of oncogene discovery, clinical decision-making, bioinformatics, and drug discovery. The advent of next-generation sequencing data makes CNV detection possible, and a large number of CNV detection tools are based on next-generation sequencing data. Due to the complexity (e.g., bias, noise, alignment errors) of next-generation sequencing data and CNV structures, the accuracy of existing methods in detecting CNVs remains low. In this work, we design a new CNV detection approach, called shortest path-based Copy number variation (SPCNV), to improve the detection accuracy of CNVs. SPCNV calculates the k nearest neighbors of each read depth and defines the shortest path, shortest path relation, and shortest path cost sets based on which further calculates the mean shortest path cost of each read depth and its k nearest neighbors. We utilize the ratio between the mean shortest path cost for each read depth and the mean of the mean shortest path cost of its k nearest neighbors to construct a relative shortest path score formula that is able to determine a score for each read depth. Based on the score profile, a boxplot is then applied to predict CNVs. The performance of the proposed method is verified by simulation data experiments and compared against several popular methods of the same type. Experimental results show that the proposed method achieves the best balance between recall and precision in each set of simulated samples. To further verify the performance of the proposed method in real application scenarios, we then select real sample data from the 1,000 Genomes Project to conduct experiments. The proposed method achieves the best F1-scores in almost all samples. Therefore, the proposed method can be used as a more reliable tool for the routine detection of CNVs.
拷贝数变异(CNV)是人类基因组中的主要结构变异之一,占变异的相当大比例。由于CNV可直接或间接导致人类患癌症、精神疾病和遗传疾病,因此在癌基因发现、临床决策、生物信息学和药物发现等领域,对其在人类中的有效检测具有极大的研究兴趣。下一代测序数据的出现使CNV检测成为可能,并且大量的CNV检测工具都是基于下一代测序数据的。由于下一代测序数据和CNV结构的复杂性(例如偏差、噪声、比对错误),现有方法检测CNV的准确性仍然较低。在这项工作中,我们设计了一种新的CNV检测方法,称为基于最短路径的拷贝数变异(SPCNV),以提高CNV的检测准确性。SPCNV计算每个读深度的k个最近邻,并定义最短路径、最短路径关系和最短路径成本集,在此基础上进一步计算每个读深度及其k个最近邻的平均最短路径成本。我们利用每个读深度的平均最短路径成本与其k个最近邻的平均最短路径成本的平均值之间的比率,构建一个相对最短路径得分公式,该公式能够为每个读深度确定一个得分。基于得分概况,然后应用箱线图来预测CNV。通过模拟数据实验验证了所提方法的性能,并与几种同类流行方法进行了比较。实验结果表明,所提方法在每组模拟样本中实现了召回率和精确率之间的最佳平衡。为了进一步验证所提方法在实际应用场景中的性能,我们随后从千人基因组计划中选择真实样本数据进行实验。所提方法在几乎所有样本中都取得了最佳的F1分数。因此,所提方法可作为一种更可靠的工具用于CNV的常规检测。