在基于人群的关联研究中，通过单倍型的贝叶斯分区建模对未分型的单核苷酸多态性（SNP）基因型数据进行直接分析。

Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes.

作者信息

Morris Andrew P

机构信息

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom.

出版信息

Genet Epidemiol. 2005 Sep;29(2):91-107. doi: 10.1002/gepi.20080.

DOI:10.1002/gepi.20080

PMID:15940704

Abstract

We describe a novel method for assessing the strength of disease association with single nucleotide polymorphisms (SNPs) in a candidate gene or small candidate region, and for estimating the corresponding haplotype relative risks of disease, using unphased genotype data directly. We begin by estimating the relative frequencies of haplotypes consistent with observed SNP genotypes. Under the Bayesian partition model, we specify cluster centres from this set of consistent SNP haplotypes. The remaining haplotypes are then assigned to the cluster with the "nearest" centre, where distance is defined in terms of SNP allele matches. Within a logistic regression modelling framework, each haplotype within a cluster is assigned the same disease risk, reducing the number of parameters required. Uncertainty in phase assignment is addressed by considering all possible haplotype configurations consistent with each unphased genotype, weighted in the logistic regression likelihood by their probabilities, calculated according to the estimated relative haplotype frequencies. We develop a Markov chain Monte Carlo algorithm to sample over the space of haplotype clusters and corresponding disease risks, allowing for covariates that might include environmental risk factors or polygenic effects. Application of the algorithm to SNP genotype data in an 890-kb region flanking the CYP2D6 gene illustrates that we can identify clusters of haplotypes with similar risk of poor drug metaboliser (PDM) phenotype, and can distinguish PDM cases carrying different high-risk variants. Further, the results of a detailed simulation study suggest that we can identify positive evidence of association for moderate relative disease risks with a sample of 1,000 cases and 1,000 controls.

摘要

我们描述了一种新方法，可直接使用未分型的基因型数据，评估候选基因或小候选区域中疾病与单核苷酸多态性（SNP）的关联强度，并估计相应疾病的单倍型相对风险。我们首先估计与观察到的SNP基因型一致的单倍型的相对频率。在贝叶斯划分模型下，我们从这组一致的SNP单倍型中指定聚类中心。然后将其余单倍型分配到具有“最近”中心的聚类中，这里的距离是根据SNP等位基因匹配来定义的。在逻辑回归建模框架内，聚类中的每个单倍型被赋予相同的疾病风险，从而减少所需参数的数量。通过考虑与每个未分型基因型一致的所有可能单倍型配置来解决相位分配的不确定性，这些配置在逻辑回归似然中根据估计的相对单倍型频率计算的概率进行加权。我们开发了一种马尔可夫链蒙特卡罗算法，在单倍型聚类空间和相应疾病风险上进行采样，同时考虑可能包括环境风险因素或多基因效应的协变量。将该算法应用于CYP2D6基因侧翼890 kb区域的SNP基因型数据表明，我们可以识别出具有相似药物代谢不良（PDM）表型风险的单倍型聚类，并区分携带不同高风险变异的PDM病例。此外，详细模拟研究的结果表明，对于中等相对疾病风险，我们可以用1000例病例和1000例对照的样本识别出关联的阳性证据。