Wang Miaoyan, Jakobsdottir Johanna, Smith Albert V, McPeek Mary Sara
Department of Statistics, University of Chicago, Chicago, Illinois, United States of America.
Icelandic Heart Association, Kopavogur, Iceland.
Genet Epidemiol. 2016 Sep;40(6):446-60. doi: 10.1002/gepi.21982. Epub 2016 Jun 3.
In a large-scale genetic association study, the number of phenotyped individuals available for sequencing may, in some cases, be greater than the study's sequencing budget will allow. In that case, it can be important to prioritize individuals for sequencing in a way that optimizes power for association with the trait. Suppose a cohort of phenotyped individuals is available, with some subset of them possibly already sequenced, and one wants to choose an additional fixed-size subset of individuals to sequence in such a way that the power to detect association is maximized. When the phenotyped sample includes related individuals, power for association can be gained by including partial information, such as phenotype data of ungenotyped relatives, in the analysis, and this should be taken into account when assessing whom to sequence. We propose G-STRATEGY, which uses simulated annealing to choose a subset of individuals for sequencing that maximizes the expected power for association. In simulations, G-STRATEGY performs extremely well for a range of complex disease models and outperforms other strategies with, in many cases, relative power increases of 20-40% over the next best strategy, while maintaining correct type 1 error. G-STRATEGY is computationally feasible even for large datasets and complex pedigrees. We apply G-STRATEGY to data on high-density lipoprotein and low-density lipoprotein from the AGES-Reykjavik and REFINE-Reykjavik studies, in which G-STRATEGY is able to closely approximate the power of sequencing the full sample by selecting for sequencing a only small subset of the individuals.
在一项大规模基因关联研究中,在某些情况下,可供测序的已表型分型个体数量可能会超过研究的测序预算所能承受的范围。在这种情况下,以一种优化与该性状关联检测效能的方式对个体进行测序优先级排序就很重要。假设有一组已表型分型的个体,其中一些子集可能已经测序,并且有人想选择另外一个固定大小的个体子集进行测序,以使检测关联的效能最大化。当已表型分型的样本包含亲属个体时,通过在分析中纳入部分信息(如未基因分型亲属的表型数据)可以提高关联检测效能,并且在评估对哪些个体进行测序时应考虑到这一点。我们提出了G-STRATEGY方法,它使用模拟退火算法来选择一个个体子集进行测序,以使关联的预期效能最大化。在模拟中,对于一系列复杂疾病模型,G-STRATEGY方法表现极为出色,在许多情况下,其效能相对于次优方法提高了20%-40%,同时保持了正确的I型错误率。即使对于大型数据集和复杂家系,G-STRATEGY方法在计算上也是可行的。我们将G-STRATEGY方法应用于AGES-雷克雅未克研究和REFINE-雷克雅未克研究中的高密度脂蛋白和低密度脂蛋白数据,在这些研究中,G-STRATEGY方法能够通过仅选择一小部分个体进行测序,来非常接近对整个样本进行测序的效能。