用于关联研究的单核苷酸多态性选择：在单核苷酸多态性选择和研究规模上最大化效能

SNP selection for association studies: maximizing power across SNP choice and study size.

作者信息

Pardi F, Lewis C M, Whittaker J C

机构信息

Department of Medical and Molecular Genetics, Guy's, King's and St. Thomas' School of Medicine, King's College London, London, UK.

出版信息

Ann Hum Genet. 2005 Nov;69(Pt 6):733-46. doi: 10.1111/j.1529-8817.2005.00202.x.

DOI:10.1111/j.1529-8817.2005.00202.x

PMID:16266411

Abstract

Selection of single nucleotide polymorphisms (SNPs) is a problem of primary importance in association studies and several approaches have been proposed. However, none provides a satisfying answer to the problem of how many SNPs should be selected, and how this should depend on the pattern of linkage disequilibrium (LD) in the region under consideration. Moreover, SNP selection is usually considered as independent from deciding the sample size of the study. However, when resources are limited there is a tradeoff between the study size and the number of SNPs to genotype. We show that tuning the SNP density to the LD pattern can be achieved by looking for the best solution to this tradeoff. Our approach consists of formulating SNP selection as an optimization problem: the objective is to maximize the power of the final association study, whilst keeping the total costs below a given budget. We also propose two alternative algorithms for the solution of this optimization problem: a genetic algorithm and a hill climbing search. These standard techniques efficiently find good solutions, even when the number of possible SNPs to choose from is large. We compare the performance of these two algorithms on different chromosomal regions and show that, as expected, the selected SNPs reflect the LD pattern: the optimal SNP density varies dramatically between chromosomal regions.

摘要

单核苷酸多态性（SNP）的选择是关联研究中至关重要的问题，人们已提出了多种方法。然而，对于应选择多少个SNP以及这应如何取决于所考虑区域的连锁不平衡（LD）模式这一问题，尚无令人满意的答案。此外，SNP选择通常被认为与确定研究样本量无关。然而，当资源有限时，研究规模与要进行基因分型的SNP数量之间存在权衡。我们表明，通过寻找此权衡的最佳解决方案，可以根据LD模式调整SNP密度。我们的方法包括将SNP选择表述为一个优化问题：目标是在将总成本保持在给定预算以下的同时，最大化最终关联研究的效能。我们还针对此优化问题的解决方案提出了两种替代算法：遗传算法和爬山搜索。即使可供选择的SNP数量很大，这些标准技术也能有效地找到好的解决方案。我们比较了这两种算法在不同染色体区域的性能，并表明，正如预期的那样，所选的SNP反映了LD模式：不同染色体区域的最佳SNP密度差异很大。