Kang Hosung, Qin Zhaohui S, Niu Tianhua, Liu Jun S
Department of Statistics, Harvard University, Cambridge, MA 02138, USA.
Am J Hum Genet. 2004 Mar;74(3):495-510. doi: 10.1086/382284. Epub 2004 Feb 13.
The accuracy of the vast amount of genotypic information generated by high-throughput genotyping technologies is crucial in haplotype analyses and linkage-disequilibrium mapping for complex diseases. To date, most automated programs lack quality measures for the allele calls; therefore, human interventions, which are both labor intensive and error prone, have to be performed. Here, we propose a novel genotype clustering algorithm, GeneScore, based on a bivariate t-mixture model, which assigns a set of probabilities for each data point belonging to the candidate genotype clusters. Furthermore, we describe an expectation-maximization (EM) algorithm for haplotype phasing, GenoSpectrum (GS)-EM, which can use probabilistic multilocus genotype matrices (called "GenoSpectrum") as inputs. Combining these two model-based algorithms, we can perform haplotype inference directly on raw readouts from a genotyping machine, such as the TaqMan assay. By using both simulated and real data sets, we demonstrate the advantages of our probabilistic approach over the current genotype scoring methods, in terms of both the accuracy of haplotype inference and the statistical power of haplotype-based association analyses.
高通量基因分型技术所产生的大量基因型信息的准确性,对于复杂疾病的单倍型分析和连锁不平衡图谱绘制至关重要。到目前为止,大多数自动化程序缺乏对等位基因调用的质量评估;因此,必须进行人工干预,而这既耗费人力又容易出错。在此,我们基于双变量t混合模型提出了一种新颖的基因型聚类算法GeneScore,该算法为属于候选基因型簇的每个数据点分配一组概率。此外,我们描述了一种用于单倍型分型的期望最大化(EM)算法,即GenoSpectrum(GS)-EM,它可以使用概率多位点基因型矩阵(称为“GenoSpectrum”)作为输入。将这两种基于模型的算法相结合,我们可以直接对基因分型机器(如TaqMan分析)的原始读数进行单倍型推断。通过使用模拟数据集和真实数据集,我们证明了我们的概率方法在单倍型推断准确性和基于单倍型的关联分析统计功效方面优于当前的基因型评分方法。