Panarella Michela, Burkett Kelly M
Department of Biology, University of Ottawa, Ottawa, ON, Canada.
Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada.
Front Genet. 2019 May 3;10:398. doi: 10.3389/fgene.2019.00398. eCollection 2019.
Extreme phenotype sampling (EPS) is a popular study design used to reduce genotyping or sequencing costs. Assuming continuous phenotype data are available on a large cohort, EPS involves genotyping or sequencing only those individuals with extreme phenotypic values. Although this design has been shown to have high power to detect genetic effects even at smaller sample sizes, little attention has been paid to the effects of confounding variables, and in particular population stratification. Using extensive simulations, we demonstrate that the false positive rate under the EPS design is greatly inflated relative to a random sample of equal size or a "case-control"-like design where the cases are from one phenotypic extreme and the controls randomly sampled. The inflated false positive rate is observed even with allele frequency and phenotype mean differences taken from European population data. We show that the effects of confounding are not reduced by increasing the sample size. We also show that including the top principal components in a logistic regression model is sufficient for controlling the type 1 error rate using data simulated with a population genetics model and using 1,000 Genomes genotype data. Our results suggest that when an EPS study is conducted, it is crucial to adjust for all confounding variables. For genetic association studies this requires genotyping a sufficient number of markers to allow for ancestry estimation. Unfortunately, this could increase the costs of a study if sequencing or genotyping was only planned for candidate genes or pathways; the available genetic data would not be suitable for ancestry correction as many of the variants could have a true association with the trait.
极端表型抽样(EPS)是一种常用的研究设计,用于降低基因分型或测序成本。假设在一个大型队列中可获得连续的表型数据,EPS仅对那些具有极端表型值的个体进行基因分型或测序。尽管这种设计已被证明即使在较小样本量时也具有较高的检测遗传效应的能力,但很少有人关注混杂变量的影响,特别是群体分层的影响。通过广泛的模拟,我们证明,相对于相同大小的随机样本或“病例对照”样设计(其中病例来自一个表型极端,对照随机抽样),EPS设计下的假阳性率大幅膨胀。即使采用来自欧洲人群数据的等位基因频率和表型均值差异,也会观察到假阳性率膨胀。我们表明,增加样本量并不能降低混杂效应。我们还表明,在逻辑回归模型中纳入前几个主成分足以使用群体遗传模型模拟的数据和1000基因组基因型数据来控制I型错误率。我们的结果表明,进行EPS研究时,对所有混杂变量进行调整至关重要。对于基因关联研究,这需要对足够数量的标记进行基因分型以进行祖先估计。不幸的是,如果仅计划对候选基因或途径进行测序或基因分型,这可能会增加研究成本;可用的遗传数据将不适合进行祖先校正,因为许多变异可能与该性状存在真正的关联。