Luo Xizhi, Qin Fei, Cai Guoshuai, Xiao Feifei
Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA.
Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA.
Bioinformatics. 2021 Apr 20;37(3):312-317. doi: 10.1093/bioinformatics/btaa737.
Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD).
We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms.
https://github.com/FeifeiXiaoUSC/LDcnv.
Supplementary data are available at Bioinformatics online.
拷贝数变异在人类复杂疾病中发挥着重要作用。拷贝数变异(CNV)的检测是通过识别基因强度中的均值偏移来定位染色体断点,这一步骤被称为染色体分割。已经开发了许多分割算法,这些算法在基因座上有一个很强的独立观察假设,并且它们假设每个基因座成为断点(即CNV边界)的机会均等。然而,从遗传学角度来看,由于基因组位置之间存在相关性,如连锁不平衡(LD),这一假设并不成立。我们的研究表明,LD结构与CNV的位置分布相关,其在基因组上确实呈现出非随机模式。为了生成更准确的CNV,我们提出了一种新算法LDcnv,该算法利用与遗传依赖结构(即LD)相关的生物学特征对CNV数据进行建模。
我们从理论上证明了SNP阵列中CNV数据的相关结构,这进一步支持了在CNV检测的统计方法中整合生物学结构的必要性。因此,我们开发了LDcnv,它将基因组相关结构与局部搜索策略整合到CNV强度的统计建模中。为了评估LDcnv的性能,我们进行了广泛的模拟并分析了大规模的HapMap数据集。我们表明,与现有方法相比,LDcnv在CNV检测中具有高精度、稳定性和鲁棒性,在检测短CNV方面具有更高的精度。这种新的分割算法在处理来自各种高通量技术平台的数据时具有广泛的潜在应用。
https://github.com/FeifeiXiaoUSC/LDcnv。
补充数据可在《生物信息学》在线获取。