Zhu Wensheng, Kuk Anthony Y C, Guo Jianhua
Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, P. R. China.
Biom J. 2009 Aug;51(4):644-58. doi: 10.1002/bimj.200800215.
Inference of haplotypes is important in genetic epidemiology studies. However, all large genotype data sets have errors due to the use of inexpensive genotyping machines that are fallible and shortcomings in genotyping scoring softwares, which can have an enormous impact on haplotype inference. In this article, we propose two novel strategies to reduce the impact induced by genotyping errors in haplotype inference. The first method makes use of double sampling. For each individual, the "GenoSpectrum" that consists of all possible genotypes and their corresponding likelihoods are computed. The second method is a genotype clustering algorithm based on multi-genotyping data, which also assigns a "GenoSpectrum" for each individual. We then describe two hybrid EM algorithms (called DS-EM and MG-EM) that perform haplotype inference based on "GenoSpectrum" of each individual obtained by double sampling and multi-genotyping data. Both simulated data sets and a quasi real-data set demonstrate that our proposed methods perform well in different situations and outperform the conventional EM algorithm and the HMM algorithm proposed by Sun, Greenwood, and Neal (2007, Genetic Epidemiology 31, 937-948) when the genotype data sets have errors.
单倍型推断在遗传流行病学研究中很重要。然而,由于使用了易出错的廉价基因分型机器以及基因分型评分软件的缺陷,所有大型基因型数据集都存在错误,这可能对单倍型推断产生巨大影响。在本文中,我们提出了两种新策略来减少基因分型错误在单倍型推断中所产生的影响。第一种方法利用双重抽样。对于每个个体,计算由所有可能基因型及其相应似然性组成的“基因谱”。第二种方法是基于多基因分型数据的基因型聚类算法,它也为每个个体分配一个“基因谱”。然后,我们描述了两种混合期望最大化算法(称为DS-EM和MG-EM),它们基于通过双重抽样和多基因分型数据获得的每个个体的“基因谱”进行单倍型推断。模拟数据集和一个准真实数据集均表明,我们提出的方法在不同情况下表现良好,并且当基因型数据集存在错误时,优于传统的期望最大化算法以及Sun、Greenwood和Neal(2007年,《遗传流行病学》31卷,937 - 948页)提出的隐马尔可夫模型算法。