Keith Jonathan M, McRae Allan, Duffy David, Mengersen Kerrie, Visscher Peter M
School of Mathematical Sciences, Queensland University of Technology, Brisbane, Qld. 4001, Australia.
Genet Epidemiol. 2008 Sep;32(6):513-9. doi: 10.1002/gepi.20324.
The probabilities that two individuals share 0, 1, or 2 alleles identical by descent (IBD) at a given genotyped marker locus are quantities of fundamental importance for disease gene and quantitative trait mapping and in family-based tests of association. Until recently, genotyped markers were sufficiently sparse that founder haplotypes could be modelled as having been drawn from a population in linkage equilibrium for the purpose of estimating IBD probabilities. However, with the advent of high-throughput single nucleotide polymorphism genotyping assays, this is no longer a reasonable assumption. Indeed, the imminent arrival of individual sequencing will enable high-density single nucleotide polymorphism genotyping on a scale for which current algorithms are not equipped. In this paper, we present a simple new model in which founder haplotypes are modelled as a Markov chain. Another important innovation is that genotyping errors are explicitly incorporated into the model. We compare results obtained using the new model to those obtained using the popular genetic linkage analysis package Merlin, with and without using the cluster model of linkage disequilibrium that is incorporated into that program. We find that the new model results in accuracy approaching that of Merlin with haplotype blocks, but achieves this with orders of magnitude faster run times. Moreover, the new algorithm scales linearly with number of markers, irrespective of density, whereas Merlin scales supralinearly. We also confirm a previous finding that ignoring linkage disequilibrium in founder haplotypes can cause errors in the calculation of IBD probabilities.
在给定的基因分型标记位点上,两个个体通过血缘共享0、1或2个相同等位基因(IBD)的概率,对于疾病基因和数量性状定位以及基于家系的关联检验来说,是至关重要的量。直到最近,基因分型标记还足够稀疏,以至于为了估计IBD概率,可以将奠基者单倍型建模为从处于连锁平衡的群体中抽取。然而,随着高通量单核苷酸多态性基因分型检测技术的出现,这不再是一个合理的假设。事实上,个体测序即将到来,将能够进行高密度单核苷酸多态性基因分型,而目前的算法还无法应对这样的规模。在本文中,我们提出了一个简单的新模型,其中奠基者单倍型被建模为一个马尔可夫链。另一个重要的创新是,基因分型错误被明确纳入模型。我们将使用新模型得到的结果与使用流行的遗传连锁分析软件Merlin得到的结果进行比较,包括使用和不使用该程序中纳入的连锁不平衡聚类模型的情况。我们发现,新模型的准确性接近使用单倍型模块的Merlin,但运行时间要快几个数量级。此外,新算法与标记数量呈线性比例关系,与密度无关,而Merlin呈超线性比例关系。我们还证实了之前的一个发现,即忽略奠基者单倍型中的连锁不平衡会导致IBD概率计算中的错误。