Guey Lin T, Kravic Jasmina, Melander Olle, Burtt Noël P, Laramie Jason M, Lyssenko Valeriya, Jonsson Anna, Lindholm Eero, Tuomi Tiinamaija, Isomaa Bo, Nilsson Peter, Almgren Peter, Kathiresan Sekar, Groop Leif, Seymour Albert B, Altshuler David, Voight Benjamin F
Applied Quantitative Genotherapeutics, Pfizer Biotherapeutics, Cambridge, MA 02144, USA.
Genet Epidemiol. 2011 May;35(4):236-46. doi: 10.1002/gepi.20572.
Next-generation sequencing technologies are making it possible to study the role of rare variants in human disease. Many studies balance statistical power with cost-effectiveness by (a) sampling from phenotypic extremes and (b) utilizing a two-stage design. Two-stage designs include a broad-based discovery phase and selection of a subset of potential causal genes/variants to be further examined in independent samples. We evaluate three parameters: first, the gain in statistical power due to extreme sampling to discover causal variants; second, the informativeness of initial (Phase I) association statistics to select genes/variants for follow-up; third, the impact of extreme and random sampling in (Phase 2) replication. We present a quantitative method to select individuals from the phenotypic extremes of a binary trait, and simulate disease association studies under a variety of sample sizes and sampling schemes. First, we find that while studies sampling from extremes have excellent power to discover rare variants, they have limited power to associate them to phenotype—suggesting high false-negative rates for upcoming studies. Second, consistent with previous studies, we find that the effect sizes estimated in these studies are expected to be systematically larger compared with the overall population effect size; in a well-cited lipids study, we estimate the reported effect to be twofold larger. Third, replication studies require large samples from the general population to have sufficient power; extreme sampling could reduce the required sample size as much as fourfold. Our observations offer practical guidance for the design and interpretation of studies that utilize extreme sampling.
新一代测序技术使研究罕见变异在人类疾病中的作用成为可能。许多研究通过以下方式在统计效力和成本效益之间取得平衡:(a) 从表型极端情况中抽样;(b) 采用两阶段设计。两阶段设计包括一个广泛的发现阶段,以及选择一组潜在的因果基因/变异在独立样本中进行进一步研究。我们评估三个参数:第一,极端抽样在发现因果变异方面的统计效力增益;第二,初始(第一阶段)关联统计在选择后续研究的基因/变异方面的信息量;第三,极端抽样和随机抽样在(第二阶段)重复研究中的影响。我们提出一种从二元性状的表型极端情况中选择个体的定量方法,并在各种样本量和抽样方案下模拟疾病关联研究。首先,我们发现虽然从极端情况中抽样的研究在发现罕见变异方面具有出色的效力,但将它们与表型关联起来的效力有限——这表明即将开展的研究假阴性率较高。其次,与之前的研究一致,我们发现这些研究中估计的效应大小与总体人群效应大小相比预计会系统性地更大;在一项被广泛引用的脂质研究中,我们估计报告的效应大两倍。第三,重复研究需要从一般人群中获取大量样本才能有足够的效力;极端抽样可将所需样本量减少多达四倍。我们的观察结果为利用极端抽样的研究的设计和解释提供了实际指导。