Marioni John C, White Michael, Tavaré Simon, Lynch Andrew G
Department of Oncology, Computational Biology Group, University of Cambridge, Cancer Research UK Cambridge Research Institute, Robinson Way, Cambridge, United Kingdom.
Proc Natl Acad Sci U S A. 2008 Jul 22;105(29):10067-72. doi: 10.1073/pnas.0711252105. Epub 2008 Jul 15.
Recently, the extent of copy number variation (CNV) throughout the genome has been shown to be far greater than previously thought. Further, it has been demonstrated that specific copy number variable regions (CNVRs) are associated with particular diseases, suggesting that these genetic variations may have an important biological role. Hence, calling CNVRs and subsequently classifying samples as "losses" or "gains" is of great interest. A number of papers have been published containing classifications of CNVs, and here we show how the presence of pedigree information can be used for assessing the performance of those classification methods. In this article, by examining CNV classifications made in the HapMap samples, we show that estimates of the number of false-positive classifications per individual made by current approaches can be determined. Moreover, commonplace technologies for determining the locations of CNVRs aggregate information across the maternal and paternal chromosomes at the locus of interest. Here, we show that copy number variation on each chromosome can be inferred and, in particular, we discuss the existence of a class of CNVs that are inevitably misclassified and give an estimate of their prevalence. Although our focus is not on the development of calling algorithms per se, we describe and provide an example of how our model might be incorporated into the initial classification procedure to produce more robust results. Finally, we discuss how this methodology might be applied to future studies to obtain better estimates of the extent of CNV across the genome.
最近研究表明,整个基因组中拷贝数变异(CNV)的范围远比之前认为的要大得多。此外,已经证明特定的拷贝数可变区(CNVR)与特定疾病相关,这表明这些基因变异可能具有重要的生物学作用。因此,识别CNVR并随后将样本分类为“缺失”或“增加”备受关注。已经发表了许多包含CNV分类的论文,在这里我们展示了系谱信息的存在如何用于评估那些分类方法的性能。在本文中,通过检查HapMap样本中的CNV分类,我们表明可以确定当前方法对每个个体的假阳性分类数量的估计。此外,用于确定CNVR位置的常见技术会在感兴趣的基因座处汇总来自母本和父本染色体的信息。在这里,我们表明可以推断每条染色体上的拷贝数变异,特别是,我们讨论了一类不可避免地被错误分类的CNV的存在,并给出了它们的流行率估计。虽然我们的重点不是调用算法本身的开发,但我们描述并提供了一个示例,说明我们的模型如何可以纳入初始分类过程以产生更可靠的结果。最后,我们讨论了这种方法如何应用于未来的研究,以更好地估计全基因组中CNV的范围。