The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
Genome Res. 2011 Jul;21(7):1168-80. doi: 10.1101/gr.115360.110. Epub 2011 Apr 14.
All individuals in a finite population are related if traced back long enough and will, therefore, share regions of their genomes identical by descent (IBD). Detection of such regions has several important applications-from answering questions about human evolution to locating regions in the human genome containing disease-causing variants. However, IBD regions can be difficult to detect, especially in the common case where no pedigree information is available. In particular, all existing non-pedigree based methods can only infer IBD sharing between two individuals. Here, we present a new Markov Chain Monte Carlo method for detection of IBD regions, which does not rely on any pedigree information. It is based on a probabilistic model applicable to unphased SNP data. It can take inbreeding, allele frequencies, genotyping errors, and genomic distances into account. And most importantly, it can simultaneously infer IBD sharing among multiple individuals. Through simulations, we show that the simultaneous modeling of multiple individuals makes the method more powerful and accurate than several other non-pedigree based methods. We illustrate the potential of the method by applying it to data from individuals with breast and/or ovarian cancer, and show that a known disease-causing mutation can be mapped to a 2.2-Mb region using SNP data from only five seemingly unrelated affected individuals. This would not be possible using classical linkage mapping or association mapping.
在有限的人口中,所有个体如果追溯足够长的时间,都会有亲缘关系,因此,他们的基因组会有一些因遗传而相同的区域(IBD)。检测这些区域有几个重要的应用,从回答人类进化的问题到定位人类基因组中包含致病变体的区域。然而,IBD 区域可能很难检测,特别是在没有系谱信息的常见情况下。特别是,所有现有的非系谱方法只能推断两个人之间的 IBD 共享。在这里,我们提出了一种新的马尔可夫链蒙特卡罗方法来检测 IBD 区域,该方法不依赖任何系谱信息。它基于一个适用于非相位 SNP 数据的概率模型。它可以考虑近亲繁殖、等位基因频率、基因分型错误和基因组距离。最重要的是,它可以同时推断多个个体之间的 IBD 共享。通过模拟,我们表明,多个个体的同时建模使该方法比其他几种非系谱方法更强大和准确。我们通过将其应用于患有乳腺癌和/或卵巢癌的个体的数据来说明该方法的潜力,并表明使用仅来自五个看似无关的受影响个体的 SNP 数据可以将已知的致病突变映射到 2.2Mb 区域。这是不可能使用经典的连锁映射或关联映射来实现的。