Liang Wei E, Thomas Duncan C, Conti David V
Department of Preventive Medicine, University of Southern California, Los Angeles, California.
Genet Epidemiol. 2012 Dec;36(8):870-81. doi: 10.1002/gepi.21681. Epub 2012 Sep 12.
With its potential to discover a much greater amount of genetic variation, next-generation sequencing is fast becoming an emergent tool for genetic association studies. However, the cost of sequencing all individuals in a large-scale population study is still high in comparison to most alternative genotyping options. While the ability to identify individual-level data is lost (without bar-coding), sequencing pooled samples can substantially lower costs without compromising the power to detect significant associations. We propose a hierarchical Bayesian model that estimates the association of each variant using pools of cases and controls, accounting for the variation in read depth across pools and sequencing error. To investigate the performance of our method across a range of number of pools, number of individuals within each pool, and average coverage, we undertook extensive simulations varying effect sizes, minor allele frequencies, and sequencing error rates. In general, the number of pools and pool size have dramatic effects on power while the total depth of coverage per pool has only a moderate impact. This information can guide the selection of a study design that maximizes power subject to cost, sample size, or other laboratory constraints. We provide an R package (hiPOD: hierarchical Pooled Optimal Design) to find the optimal design, allowing the user to specify a cost function, cost, and sample size limitations, and distributions of effect size, minor allele frequency, and sequencing error rate.
凭借其发现大量遗传变异的潜力,下一代测序正迅速成为遗传关联研究的一种新兴工具。然而,与大多数其他基因分型方法相比,在大规模人群研究中对所有个体进行测序的成本仍然很高。虽然在没有条形码的情况下会丢失识别个体水平数据的能力,但对混合样本进行测序可以在不影响检测显著关联能力的前提下大幅降低成本。我们提出了一种分层贝叶斯模型,该模型使用病例组和对照组的混合样本估计每个变异的关联,同时考虑不同混合样本间的测序深度差异和测序错误。为了研究我们的方法在不同数量的混合样本、每个混合样本中的个体数量以及平均覆盖度下的性能,我们进行了广泛的模拟,改变效应大小、次要等位基因频率和测序错误率。一般来说,混合样本的数量和混合样本大小对检验效能有显著影响,而每个混合样本的总覆盖深度只有适度影响。这些信息可以指导研究设计的选择,以便在成本、样本量或其他实验室限制条件下使检验效能最大化。我们提供了一个R包(hiPOD:分层混合最优设计)来寻找最优设计,允许用户指定成本函数、成本和样本量限制,以及效应大小、次要等位基因频率和测序错误率的分布。