Hao Ke, Schadt Eric E, Storey John D
Rosetta Inpharmatics, Seattle, Washington, United States of America.
PLoS Genet. 2008 Jun 27;4(6):e1000109. doi: 10.1371/journal.pgen.1000109.
To facilitate whole-genome association studies (WGAS), several high-density SNP genotyping arrays have been developed. Genetic coverage and statistical power are the primary benchmark metrics in evaluating the performance of SNP arrays. Ideally, such evaluations would be done on a SNP set and a cohort of individuals that are both independently sampled from the original SNPs and individuals used in developing the arrays. Without utilization of an independent test set, previous estimates of genetic coverage and statistical power may be subject to an overfitting bias. Additionally, the SNP arrays' statistical power in WGAS has not been systematically assessed on real traits. One robust setting for doing so is to evaluate statistical power on thousands of traits measured from a single set of individuals. In this study, 359 newly sampled Americans of European descent were genotyped using both Affymetrix 500K (Affx500K) and Illumina 650Y (Ilmn650K) SNP arrays. From these data, we were able to obtain estimates of genetic coverage, which are robust to overfitting, by constructing an independent test set from among these genotypes and individuals. Furthermore, we collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Our genetic coverage estimates are lower than previous reports, providing evidence that previous estimates may be inflated due to overfitting. The Ilmn650K platform showed reasonable power (50% or greater) to detect SNPs associated with quantitative traits when the signal-to-noise ratio (SNR) is greater than or equal to 0.5 and the causal SNP's minor allele frequency (MAF) is greater than or equal to 20% (N = 359). In testing each of the more than 40,000 gene expression traits for association to each of the SNPs on the Ilmn650K and Affx500K arrays, we found that the Ilmn650K yielded 15% times more discoveries than the Affx500K at the same false discovery rate (FDR) level.
为促进全基因组关联研究(WGAS),已开发了几种高密度单核苷酸多态性(SNP)基因分型阵列。基因覆盖率和统计效能是评估SNP阵列性能的主要基准指标。理想情况下,此类评估应在一个SNP集和一组个体上进行,这些个体均从原始SNP和用于开发阵列的个体中独立抽样。如果不使用独立测试集,先前对基因覆盖率和统计效能的估计可能会存在过度拟合偏差。此外,SNP阵列在全基因组关联研究中的统计效能尚未在实际性状上进行系统评估。进行此类评估的一个可靠方法是在从一组个体中测量的数千个性状上评估统计效能。在本研究中,使用Affymetrix 500K(Affx500K)和Illumina 650Y(Ilmn650K)SNP阵列对359名新采样的欧洲裔美国人进行了基因分型。从这些数据中,我们通过从这些基因型和个体中构建一个独立测试集,获得了对基因覆盖率的估计,该估计对过度拟合具有鲁棒性。此外,我们从参与者中收集了肝脏组织RNA,并在一个综合基因表达微阵列上对这些样本进行了分析。RNA水平被用作一组大规模的数量性状,以校准商业阵列的相对统计效能。我们的基因覆盖率估计低于先前的报告,这表明先前的估计可能因过度拟合而被夸大。当信噪比(SNR)大于或等于0.5且因果SNP的次要等位基因频率(MAF)大于或等于20%(N = 359)时,Ilmn650K平台显示出合理的效能(50%或更高)来检测与数量性状相关的SNP。在测试Ilmn650K和Affx500K阵列上的每个SNP与40000多个基因表达性状中的每一个的关联性时,我们发现在相同的错误发现率(FDR)水平下,Ilmn650K的发现比Affx500K多15倍。