Verzilli Claudio J, Stallard Nigel, Whittaker John C
Department of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, UK.
Am J Hum Genet. 2006 Jul;79(1):100-12. doi: 10.1086/505313. Epub 2006 May 30.
As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain-Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.
随着人类遗传变异程度得到更全面的表征,研究界面临着一项具有挑战性的任务,即利用这些信息剖析复杂性状的遗传成分。全基因组关联研究在这方面展现出巨大潜力,但其分析也带来了巨大困难。在本文中,我们描述了一种计算效率高的方法来挖掘基因型与表型之间的关联,该方法能够适应此类研究中当前正在收集的数据集的规模。我们使用离散图形模型作为数据挖掘工具,在致病位点周围搜索单基因座或多基因座的关联模式。该方法是完全贝叶斯的,使我们能够纳入由于连锁不平衡而在每个标记周围的空间依赖性的先验知识,这大大减少了可能的图形结构数量。我们开发了一种马尔可夫链蒙特卡罗方案,该方案根据数据生成图形后验分布的样本,据此可以对任何基因型与表型关联的强度做出概率陈述。使用在标记密度、致病等位基因的基因型相对风险和遗传模式各不相同的情况下模拟的数据,我们表明,与单基因座分析相比,所提出的方法具有更好的定位特性,并且导致的假阳性率更低。最后,我们将我们的方法应用于一个准合成数据集,其中来自CYP2D6区域的数据嵌入在关于10万个单核苷酸多态性的模拟数据中。分析速度很快(<5分钟),并且我们能够将致病位点定位到非常短的区间内。