Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, Washington, United States of America.
PLoS One. 2010 Dec 30;5(12):e14456. doi: 10.1371/journal.pone.0014456.
The detection of copy number variants (CNVs) and the results of CNV-disease association studies rely on how CNVs are defined, and because array-based technologies can only infer CNVs, CNV-calling algorithms can produce vastly different findings. Several authors have noted the large-scale variability between CNV-detection methods, as well as the substantial false positive and false negative rates associated with those methods. In this study, we use variations of four common algorithms for CNV detection (PennCNV, QuantiSNP, HMMSeg, and cnvPartition) and two definitions of overlap (any overlap and an overlap of at least 40% of the smaller CNV) to illustrate the effects of varying algorithms and definitions of overlap on CNV discovery.
We used a 56 K Illumina genotyping array enriched for CNV regions to generate hybridization intensities and allele frequencies for 48 Caucasian schizophrenia cases and 48 age-, ethnicity-, and gender-matched control subjects. No algorithm found a difference in CNV burden between the two groups. However, the total number of CNVs called ranged from 102 to 3,765 across algorithms. The mean CNV size ranged from 46 kb to 787 kb, and the average number of CNVs per subject ranged from 1 to 39. The number of novel CNVs not previously reported in normal subjects ranged from 0 to 212.
Motivated by the availability of multiple publicly available genome-wide SNP arrays, investigators are conducting numerous analyses to identify putative additional CNVs in complex genetic disorders. However, the number of CNVs identified in array-based studies, and whether these CNVs are novel or valid, will depend on the algorithm(s) used. Thus, given the variety of methods used, there will be many false positives and false negatives. Both guidelines for the identification of CNVs inferred from high-density arrays and the establishment of a gold standard for validation of CNVs are needed.
拷贝数变异(CNV)的检测和 CNV 与疾病关联研究的结果依赖于 CNV 的定义方式,由于基于阵列的技术只能推断 CNV,因此 CNV 调用算法可能会产生大相径庭的结果。几位作者已经注意到 CNV 检测方法之间存在大规模的可变性,以及这些方法相关的大量假阳性和假阴性率。在这项研究中,我们使用了四种常见的 CNV 检测算法(PennCNV、QuantiSNP、HMMSeg 和 cnvPartition)的变体以及两种重叠定义(任何重叠和至少 40%较小 CNV 的重叠)来说明不同算法和重叠定义对 CNV 发现的影响。
我们使用经过 CNV 区域富集的 56 K Illumina 基因分型阵列来生成 48 例白种人精神分裂症病例和 48 例年龄、种族和性别匹配的对照个体的杂交强度和等位基因频率。没有一种算法在两组之间发现 CNV 负担的差异。然而,各种算法之间调用的 CNV 总数从 102 到 3765 不等。CNV 的平均大小范围从 46 kb 到 787 kb,每个个体的平均 CNV 数量从 1 到 39 不等。新发现的以前在正常个体中未报道的 CNV 数量从 0 到 212 不等。
受多种可用的全基因组 SNP 阵列的启发,研究人员正在进行大量分析以确定复杂遗传疾病中的潜在额外 CNV。然而,基于阵列的研究中识别的 CNV 数量,以及这些 CNV 是否是新的或有效的,将取决于使用的算法。因此,鉴于使用的方法种类繁多,将会有许多假阳性和假阴性。需要为从高密度阵列推断的 CNV 识别制定指南和建立 CNV 验证的金标准。