Malekpour Seyed Amir, Pezeshk Hamid, Sadeghi Mehdi
School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran; School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran.
Math Biosci. 2016 Sep;279:53-62. doi: 10.1016/j.mbs.2016.07.006. Epub 2016 Jul 16.
Association of Copy Number Variation (CNV) with schizophrenia, autism, developmental disabilities and fatal diseases such as cancer is verified. Recent developments in Next Generation Sequencing (NGS) have facilitated the CNV studies. However, many of the current CNV detection tools are not capable of discriminating tandem duplication from non-tandem duplications.
In this study, we propose MGP-HMM as a tool which besides detecting genome-wide deletions discriminates tandem duplications from non-tandem duplications. MGP-HMM takes mate pair abnormalities into account and predicts the digitized number of tandem or non-tandem copies. Abnormalities in the mate pair directions and insertion sizes, after being mapped to the reference genome, are elucidated using a Hidden Markov Model (HMM). For this purpose, a Mixture Gaussian density with time-dependent parameters is applied for emitting mate pair insertion sizes from HMM states. Indeed, depending on observed abnormalities in mate pair insertion size or its orientation, each component in the mixture density will have different parameters. MGP-HMM also applies a Poisson distribution for modeling read depth data. This parametric modeling of the mate pair reads enables us to estimate the length of CNVs precisely, which is an advantage over methods which rely only on read depth approach for the CNV detection. Hidden state of the proposed HMM is the digitized copy number of a genomic segment and states correspond to the multipliers of the mixture Gaussian components. The accuracy of our model is validated on a set of next generation sequencing real and simulated data and is compared to other tools.
拷贝数变异(CNV)与精神分裂症、自闭症、发育障碍以及诸如癌症等致命疾病之间的关联已得到证实。新一代测序(NGS)技术的最新发展推动了CNV研究。然而,当前许多CNV检测工具无法区分串联重复和非串联重复。
在本研究中,我们提出了MGP-HMM这一工具,它除了能检测全基因组缺失外,还能区分串联重复和非串联重复。MGP-HMM考虑了配对末端异常情况,并预测串联或非串联拷贝的数字化数量。将配对末端方向和插入大小的异常情况映射到参考基因组后,使用隐马尔可夫模型(HMM)进行阐释。为此,应用具有时间依赖性参数的混合高斯密度从HMM状态发射配对末端插入大小。实际上,根据观察到的配对末端插入大小或其方向的异常情况,混合密度中的每个成分将具有不同的参数。MGP-HMM还应用泊松分布对读深度数据进行建模。这种对配对末端读段的参数化建模使我们能够精确估计CNV的长度,这是相对于仅依赖读深度方法进行CNV检测的方法的一个优势。所提出的HMM的隐藏状态是基因组片段的数字化拷贝数,状态对应于混合高斯成分的乘数。我们的模型在一组新一代测序真实数据和模拟数据上进行了验证,并与其他工具进行了比较。