Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, United Kingdom.
Genet Epidemiol. 2011 Sep;35(6):536-48. doi: 10.1002/gepi.20604. Epub 2011 Jul 18.
Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data.
准确地确定已知拷贝数变异(CNV)位点的拷贝数对于增加对基因组结构演化的理解以及进行拷贝数与疾病的关联研究都很重要。与调用 SNP 基因型一样,该任务可以被构造成聚类问题,但由于多种原因,分配拷贝数更具挑战性。CNV 检测的信号与噪声比低于 SNP 检测,通常表现出重尾和非对称的强度分布,包含异常值观察值,并且可能在不同队列之间表现出系统的技术差异。此外,人群中特定 CNV 的拷贝数类别数量可能事先未知。由于这些复杂性,从阵列数据中自动且稳健地分配拷贝数仍然是一个具有挑战性的问题。我们已经开发了一种拷贝数分配算法,即 CNVCALL,用于靶向 CNV 阵列,例如由惠康信托基金病例对照协会的最近 CNV 关联研究使用的阵列。我们使用贝叶斯分层混合模型,该模型稳健地识别特定位置的不同拷贝数类别数量以及样本中每个个体的相对拷贝数。这种方法是完全自动化的,这是分析大量 CNV 时的关键要求。我们使用来自惠康信托基金病例对照协会的 CNV 关联研究的真实数据和模拟数据来说明方法的性能。