Chen Jie, Deng Shirong
1 Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University , Augusta, Georgia .
2 School of Mathematics and Statistics, Wuhan University , Wuhan, China .
J Comput Biol. 2018 Oct;25(10):1128-1140. doi: 10.1089/cmb.2018.0053. Epub 2018 Jul 27.
In this article, we investigate the problem of detecting boundaries of DNA copy number variation (CNV) regions using the DNA-sequencing data from multiple subject samples. Genomic features along the linear realization of the actual genome are correlated, especially within vicinity of a locus, so are the sequencing reads along the genome. It is then crucial to take the correlated structure of such high-throughput genomic data into consideration when modeling DNA-sequencing data for CNV detection from statistical and computational viewpoints. We use the framework of a fused Lasso latent feature model to solve the problem, and propose a modified information criterion for selecting the tuning parameter when search for common CNVs is shared by multiple subjects. Simulation studies and application on multiple subjects' next-generation sequencing data, downloaded from the 1000 Genome Project, showed that the proposed approach can effectively identify individual CNVs of a single subject profile and common CNVs shared by multiple subjects.
在本文中,我们研究了利用来自多个个体样本的DNA测序数据检测DNA拷贝数变异(CNV)区域边界的问题。沿着实际基因组的线性实现的基因组特征是相关的,特别是在一个基因座附近,沿着基因组的测序读数也是如此。因此,从统计和计算的角度对用于CNV检测的DNA测序数据进行建模时,考虑这种高通量基因组数据的相关结构至关重要。我们使用融合套索潜在特征模型的框架来解决该问题,并提出了一种改进的信息准则,用于在多个个体共享寻找常见CNV时选择调谐参数。对从千人基因组计划下载的多个个体的下一代测序数据进行的模拟研究和应用表明,所提出的方法可以有效地识别单个个体图谱的个体CNV以及多个个体共享的常见CNV。