The Morris Kahn Laboratory of Human Genetics, Department of Virology and Developmental Genetics, NIBN, Ben Gurion University, Israel.
Bioinformatics. 2011 Oct 15;27(20):2880-7. doi: 10.1093/bioinformatics/btr486. Epub 2011 Aug 23.
High-throughput single nucleotide polymorphism (SNP) arrays have become the standard platform for linkage and association analyses. The high SNP density of these platforms allows high-resolution identification of ancestral recombination events even for distant relatives many generations apart. However, such inference is sensitive to marker mistyping and current error detection methods rely on the genotyping of additional close relatives. Genotyping algorithms provide a confidence score for each marker call that is currently not integrated in existing methods. There is a need for a model that incorporates this prior information within the standard identical by descent (IBD) and association analyses.
We propose a novel model that incorporates marker confidence scores within IBD methods based on the Lander-Green Hidden Markov Model. The novel parameter of this model is the joint distribution of confidence scores and error status per array. We estimate this probability distribution by applying a modified expectation-maximization (EM) procedure on data from nuclear families genotyped with Affymetrix 250K SNP arrays. The converged tables from two different genotyping algorithms are shown for a wide range of error rates. We demonstrate the efficacy of our method in refining the detection of IBD signals using nuclear pedigrees and distant relatives.
Plinke, a new version of Plink with an extended pairwise IBD inference model allowing per marker error probabilities is freely available at: http://bioinfo.bgu.ac.il/bsu/software/plinke.
obirk@bgu.ac.il; markusb@bgu.ac.il
Supplementary data are available at Bioinformatics online.
高通量单核苷酸多态性 (SNP) 阵列已成为连锁和关联分析的标准平台。这些平台的 SNP 密度很高,即使是相隔多代的远亲,也能高度精确地识别祖先重组事件。然而,这种推断对标记误配很敏感,目前的错误检测方法依赖于额外近亲的基因分型。基因分型算法为每个标记调用提供了置信度评分,但目前尚未集成到现有方法中。需要一种模型,在标准的同源(IBD)和关联分析中纳入这种先验信息。
我们提出了一种新模型,该模型基于 Lander-Green 隐马尔可夫模型,将标记置信度评分纳入 IBD 方法中。该模型的新参数是每个数组的置信度评分和错误状态的联合分布。我们通过对用 Affymetrix 250K SNP 阵列基因分型的核家族数据应用修改后的期望最大化(EM)过程来估计这个概率分布。对于广泛的错误率,展示了来自两种不同基因分型算法的收敛表。我们展示了我们的方法在使用核家族和远亲细化 IBD 信号检测方面的有效性。
Plinke 是 Plink 的新版本,具有扩展的成对 IBD 推断模型,允许每个标记的错误概率,可在以下网址免费获得:http://bioinfo.bgu.ac.il/bsu/software/plinke。
obirk@bgu.ac.il; markusb@bgu.ac.il
补充数据可在 Bioinformatics 在线获得。