Thomas Duncan C, Casey Graham, Conti David V, Haile Robert W, Lewinger Juan Pablo, Stram Daniel O
Department of Preventive Medicine, University of Southern California.
Stat Sci. 2009 Nov 1;24(4):414-429. doi: 10.1214/09-sts288.
Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of "promising" SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a "replication" panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent "exact replication" study is needed in a similar population of the same promising SNPs using similar methods. This can then be followed by (1) "generalizability" studies to assess the full scope of replicated associations across different races, different endpoints, different interactions, etc.; (2) fine-mapping or re-sequencing to try to identify the causal variant; and (3) experimental studies of the biological function of these genes. Multistage sampling designs may be more useful at this stage, say for selecting subsets of subjects for deep re-sequencing of regions identified in the GWAS.
由于商业基因分型芯片技术成本高昂,许多研究在全基因组关联研究中采用了两阶段设计,即先用部分样本在较低的显著性水平下初步发现“有前景的”单核苷酸多态性(SNP),然后使用定制基因分型对仅这些SNP进行联合分析,分析其余样本。采用这种设计,通过在第一阶段使用约一半的样本,并将约0.1%的SNP推进到第二阶段,可实现约50%的典型成本节约,以获得可比的总体I型错误水平和检验效能,最优设计主要取决于第一阶段和第二阶段每个基因型的成本比。然而,随着商业检测板成本的迅速下降、当前研究中普遍观察到的较低比值比(OR)以及许多研究旨在检验多个假设和多个终点,许多研究者正放弃两阶段设计,转而倾向于使用标准高密度检测板对所有可用受试者进行基因分型。有时会有人担心这种方法中没有如一些知名期刊所要求的“重复”检测板,但必须认识到两阶段设计并非发现/重复设计,而仅仅是一种通过对两个阶段的数据进行联合分析来更高效地进行发现的设计。一旦发现了一组高度显著的关联,就需要在具有相似前景的相同SNP的相似人群中使用相似方法进行真正独立的“精确重复”研究。随后可以进行:(1)“可推广性”研究,以评估在不同种族、不同终点、不同相互作用等情况下重复关联的完整范围;(2)精细定位或重测序,以试图识别因果变异;以及(3)这些基因生物学功能的实验研究。在这个阶段,多阶段抽样设计可能更有用,比如用于选择受试者子集,对全基因组关联研究中确定的区域进行深度重测序。