Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Genet Epidemiol. 2012 Apr;36(3):253-62. doi: 10.1002/gepi.21618.
A major concern for all copy number variation (CNV) detection algorithms is their reliability and repeatability. However, it is difficult to evaluate the reliability of CNV-calling strategies due to the lack of gold-standard data that would tell us which CNVs are real. We propose that if CNVs are called in duplicate samples, or inherited from parent to child, then these can be considered validated CNVs. We used two large family-based genome-wide association study (GWAS) datasets from the GENEVA consortium to look at concordance rates of CNV calls between duplicate samples, parent-child pairs, and unrelated pairs. Our goal was to make recommendations for ways to filter and use CNV calls in GWAS datasets that do not include family data. We used PennCNV as our primary CNV-calling algorithm, and tested CNV calls using different datasets and marker sets, and with various filters on CNVs and samples. Using the Illumina core HumanHap550 single nucleotide polymorphism (SNP) set, we saw duplicate concordance rates of approximately 55% and parent-child transmission rates of approximately 28% in our datasets. GC model adjustment and sample quality filtering had little effect on these reliability measures. Stratification on CNV size and DNA sample type did have some effect. Overall, our results show that it is probably not possible to find a CNV-calling strategy (including filtering and algorithm) that will give us a set of "reliable" CNV calls using current chip technologies. But if we understand the error process, we can still use CNV calls appropriately in genetic association studies.
所有拷贝数变异 (CNV) 检测算法的一个主要关注点是它们的可靠性和可重复性。然而,由于缺乏可以告诉我们哪些 CNV 是真实的金标准数据,因此很难评估 CNV 调用策略的可靠性。我们提出,如果在重复样本或从父母遗传到子女的样本中调用 CNV,则可以认为这些 CNV 是经过验证的。我们使用 GENEVA 联盟的两个大型基于家族的全基因组关联研究 (GWAS) 数据集,研究重复样本、父母-子女对和无关对之间 CNV 调用的一致性率。我们的目标是为 GWAS 数据集提供过滤和使用 CNV 调用的建议,这些数据集不包括家族数据。我们使用 PennCNV 作为我们的主要 CNV 调用算法,并使用不同的数据集和标记集以及对 CNV 和样本的各种过滤器来测试 CNV 调用。使用 Illumina 核心 HumanHap550 单核苷酸多态性 (SNP) 集,我们在数据集看到重复一致性率约为 55%,父母-子女传递率约为 28%。GC 模型调整和样本质量过滤对这些可靠性指标几乎没有影响。CNV 大小和 DNA 样本类型的分层确实有一定的影响。总体而言,我们的结果表明,使用当前的芯片技术,可能无法找到一种 CNV 调用策略(包括过滤和算法),可以为我们提供一组“可靠”的 CNV 调用。但是,如果我们了解错误过程,仍然可以在遗传关联研究中适当地使用 CNV 调用。