Eyheramendy Susana, Marchini Jonathan, McVean Gilean, Myers Simon, Donnelly Peter
Department of Statistics, University of Oxford, Oxford, OX1 3TG, United Kingdom.
Genome Res. 2007 Jan;17(1):88-95. doi: 10.1101/gr.5675406. Epub 2006 Nov 9.
Genome-wide association studies are still constrained by the cost of genotyping. For this reason, the selection of a reduced set of markers or tags able to capture a significant proportion of the genetic variation is an important aspect of these studies. Most tagging SNP selection methods have been successful in capturing the genetic variation of the data from which the tags have been chosen. However, when these tags are used in an independent data set, a significant proportion of the remaining SNPs (non-tags) are not captured and, in most cases, there is no information on which SNPs are captured. We propose to use a probabilistic model to predict the non-tags based on a set of tags, as a way to capture genetic variation. An important advantage of this method is that it directly predicts the genotype of the non-tags with which we can test for association with the phenotype and which could help to elucidate the location of genes responsible for increasing disease susceptibility. Additionally, this method provides an estimate of the probabilities with which the predictions are made, which reflects the confidence of the probabilistic model. We also propose new methods to select the tagging SNPs. We empirically show by using HapMap data that our approach is able to capture significantly more genetic variation than methods based solely on a pairwise LD measure.
全基因组关联研究仍然受到基因分型成本的限制。因此,选择一组能够捕获相当比例遗传变异的简化标记或标签是这些研究的一个重要方面。大多数标签单核苷酸多态性(SNP)选择方法在捕获用于选择标签的数据的遗传变异方面都很成功。然而,当这些标签用于独立数据集时,相当比例的其余SNP(非标签)未被捕获,并且在大多数情况下,没有关于哪些SNP被捕获的信息。我们建议使用概率模型基于一组标签来预测非标签,以此作为捕获遗传变异的一种方法。该方法的一个重要优点是它直接预测非标签的基因型,我们可以用其来测试与表型的关联,这有助于阐明导致疾病易感性增加的基因的位置。此外,该方法提供了预测所基于的概率估计,这反映了概率模型的可信度。我们还提出了选择标签SNP的新方法。通过使用HapMap数据,我们通过实证表明,我们的方法比仅基于成对连锁不平衡(LD)测量的方法能够捕获显著更多的遗传变异。