Department of Epidemiology, The University of Texas MD Anderson Cancer Center, 1155 Pressler St,, Houston, Texas 77030, USA.
BMC Bioinformatics. 2011 Aug 9;12:331. doi: 10.1186/1471-2105-12-331.
SNP genotyping arrays have been developed to characterize single-nucleotide polymorphisms (SNPs) and DNA copy number variations (CNVs). Nonparametric and model-based statistical algorithms have been developed to detect CNVs from SNP data using the marker intensities. However, these algorithms lack specificity to detect small CNVs owing to the high false positive rate when calling CNVs based on the intensity values. Therefore, the resulting association tests lack power even if the CNVs affecting disease risk are common. An alternative procedure called PennCNV uses information from both the marker intensities as well as the genotypes and therefore has increased sensitivity.
By using the hidden Markov model (HMM) implemented in PennCNV to derive the probabilities of different copy number states which we subsequently used in a logistic regression model, we developed a new genome-wide algorithm to detect CNV associations with diseases. We compared this new method with association test applied to the most probable copy number state for each individual that is provided by PennCNV after it performs an initial HMM analysis followed by application of the Viterbi algorithm, which removes information about copy number probabilities. In one of our simulation studies, we showed that for large CNVs (number of SNPs ≥ 10), the association tests based on PennCNV calls gave more significant results, but the new algorithm retained high power. For small CNVs (number of SNPs <10), the logistic algorithm provided smaller average p-values (e.g., p = 7.54e - 17 when relative risk RR = 3.0) in all the scenarios and could capture signals that PennCNV did not (e.g., p = 0.020 when RR = 3.0). From a second set of simulations, we showed that the new algorithm is more powerful in detecting disease associations with small CNVs (number of SNPs ranging from 3 to 5) under different penetrance models (e.g., when RR = 3.0, for relatively weak signals, power = 0.8030 comparing to 0.2879 obtained from the association tests based on PennCNV calls). The new method was implemented in software GWCNV. It is freely available at http://gwcnv.sourceforge.net, distributed under a GPL license.
We conclude that the new algorithm is more sensitive and can be more powerful in detecting CNV associations with diseases than the existing HMM algorithm, especially when the CNV association signal is weak and a limited number of SNPs are located in the CNV.
SNP 基因分型阵列已被开发用于描述单核苷酸多态性(SNP)和 DNA 拷贝数变异(CNV)。已经开发了非参数和基于模型的统计算法,以使用标记强度从 SNP 数据中检测 CNV。然而,由于基于强度值调用 CNV 时的假阳性率很高,这些算法缺乏检测小 CNV 的特异性。因此,即使影响疾病风险的 CNV 很常见,由此产生的关联测试也缺乏效力。一种称为 PennCNV 的替代方法利用了标记强度以及基因型的信息,因此具有更高的灵敏度。
通过使用 PennCNV 中实现的隐马尔可夫模型(HMM)来推导不同拷贝数状态的概率,我们随后将这些概率用于逻辑回归模型中,从而开发了一种新的全基因组算法来检测与疾病相关的 CNV 关联。我们将这种新方法与关联测试进行了比较,关联测试适用于 PennCNV 执行初始 HMM 分析后,应用维特比算法为每个个体提供的最可能的拷贝数状态,该算法消除了拷贝数概率的信息。在我们的一项模拟研究中,我们表明对于大 CNV(SNP 数量≥10),基于 PennCNV 调用的关联测试给出了更显著的结果,但新算法保留了高功效。对于小 CNV(SNP 数量<10),逻辑算法在所有情况下提供了更小的平均 p 值(例如,RR=3.0 时 p=7.54e-17),并且可以捕获 PennCNV 未捕获的信号(例如,RR=3.0 时 p=0.020)。从第二组模拟中,我们表明,新算法在不同的外显率模型下(例如,RR=3.0,对于相对较弱的信号,功效=0.8030,而基于 PennCNV 调用的关联测试的功效为 0.2879),在检测与小 CNV(SNP 数量为 3 到 5)相关的疾病关联时,具有更高的功效。新方法已在软件 GWCNV 中实现。它可在 http://gwcnv.sourceforge.net 上免费获得,根据 GPL 许可证分发。
我们的结论是,新算法比现有的 HMM 算法更敏感,并且在检测与疾病相关的 CNV 关联方面更有效,特别是当 CNV 关联信号较弱且 CNV 中仅包含有限数量的 SNP 时。