Istituto di Studi sui Sistemi Intelligenti per l'Automazione - CNR, Bari, Italy.
BMC Genomics. 2011 Mar 30;12:166. doi: 10.1186/1471-2164-12-166.
The typical objective of Genome-wide association (GWA) studies is to identify single-nucleotide polymorphisms (SNPs) and corresponding genes with the strongest evidence of association (the 'most-significant SNPs/genes' approach). Borrowing ideas from micro-array data analysis, we propose a new method, named RS-SNP, for detecting sets of genes enriched in SNPs moderately associated to the phenotype. RS-SNP assesses whether the number of significant SNPs, with p-value P ≤ α, belonging to a given SNP set S is statistically significant. The rationale of proposed method is that two kinds of null hypotheses are taken into account simultaneously. In the first null model the genotype and the phenotype are assumed to be independent random variables and the null distribution is the probability of the number of significant SNPs in S greater than observed by chance. The second null model assumes the number of significant SNPs in S depends on the size of and not on the identity of the SNPs in . Statistical significance is assessed using non-parametric permutation tests.
We applied RS-SNP to the Crohn's disease (CD) data set collected by the Wellcome Trust Case Control Consortium (WTCCC) and compared the results with GENGEN, an approach recently proposed in literature. The enrichment analysis using RS-SNP and the set of pathways contained in the MSigDB C2 CP pathway collection highlighted 86 pathways rich in SNPs weakly associated to CD. Of these, 47 were also indicated to be significant by GENGEN. Similar results were obtained using the MSigDB C5 pathway collection. Many of the pathways found to be enriched by RS-SNP have a well-known connection to CD and often with inflammatory diseases.
The proposed method is a valuable alternative to other techniques for enrichment analysis of SNP sets. It is well founded from a theoretical and statistical perspective. Moreover, the experimental comparison with GENGEN highlights that it is more robust with respect to false positive findings.
全基因组关联(GWA)研究的典型目标是识别具有最强关联证据的单核苷酸多态性(SNP)和相应基因(“最显著 SNPs/基因”方法)。借鉴微阵列数据分析的思路,我们提出了一种新方法 RS-SNP,用于检测中度关联表型的 SNP 富集基因集。RS-SNP 评估属于给定 SNP 集 S 的具有 P 值 P ≤ α 的显著 SNP 的数量是否具有统计学意义。该方法的原理是同时考虑两种零假设。在第一个零模型中,基因型和表型被假设为独立的随机变量,并且零分布是 S 中显著 SNP 数量大于偶然观察到的概率。第二个零模型假设 S 中显著 SNP 的数量取决于 SNP 的大小而不是其身份。使用非参数置换检验评估统计显著性。
我们将 RS-SNP 应用于由 Wellcome Trust 病例对照联合会(WTCCC)收集的克罗恩病(CD)数据集,并将结果与文献中最近提出的 GENGEN 方法进行了比较。使用 RS-SNP 进行的富集分析和包含在 MSigDB C2 CP 通路集合中的通路集突出显示了 86 个富含与 CD 弱相关的 SNP 的通路。其中,47 个也被 GENGEN 指示为显著。使用 MSigDB C5 通路集也获得了类似的结果。通过 RS-SNP 发现的许多富集通路与 CD 有很好的关联,并且通常与炎症性疾病有关。
该方法是 SNP 集富集分析的另一种有价值的替代方法。它从理论和统计角度得到了很好的证明。此外,与 GENGEN 的实验比较突出了它在假阳性发现方面更稳健。