Long J C, Williams R C, Urbanek M
Laboratory of Neurogenetics, NIAAA/NIH, Rockville, MD 20852.
Am J Hum Genet. 1995 Mar;56(3):799-810.
This paper gives an expectation maximization (EM) algorithm to obtain allele frequencies, haplotype frequencies, and gametic disequilibrium coefficients for multiple-locus systems. It permits high polymorphism and null alleles at all loci. This approach effectively deals with the primary estimation problems associated with such systems; that is, there is not a one-to-one correspondence between phenotypic and genotypic categories, and sample sizes tend to be much smaller than the number of phenotypic categories. The EM method provides maximum-likelihood estimates and therefore allows hypothesis tests using likelihood ratio statistics that have chi 2 distributions with large sample sizes. We also suggest a data resampling approach to estimate test statistic sampling distributions. The resampling approach is more computer intensive, but it is applicable to all sample sizes. A strategy to test hypotheses about aggregate groups of gametic disequilibrium coefficients is recommended. This strategy minimizes the number of necessary hypothesis tests while at the same time describing the structure of disequilibrium. These methods are applied to three unlinked dinucleotide repeat loci in Navajo Indians and to three linked HLA loci in Gila River (Pima) Indians. The likelihood functions of both data sets are shown to be maximized by the EM estimates, and the testing strategy provides a useful description of the structure of gametic disequilibrium. Following these applications, a number of simulation experiments are performed to test how well the likelihood-ratio statistic distributions are approximated by chi 2 distributions. In most circumstances the chi 2 grossly underestimated the probability of type I errors. However, at times they also overestimated the type 1 error probability. Accordingly, we recommended hypothesis tests that use the resampling method.
本文给出了一种期望最大化(EM)算法,用于获取多位点系统的等位基因频率、单倍型频率和配子不平衡系数。它允许所有位点存在高度多态性和无效等位基因。这种方法有效地处理了与此类系统相关的主要估计问题;也就是说,表型类别和基因型类别之间不存在一一对应关系,并且样本量往往远小于表型类别的数量。EM方法提供了最大似然估计,因此允许使用具有大样本量时呈卡方分布的似然比统计量进行假设检验。我们还建议采用数据重采样方法来估计检验统计量的抽样分布。重采样方法计算量更大,但适用于所有样本量。推荐了一种检验关于配子不平衡系数聚合组假设的策略。该策略在描述不平衡结构的同时,将必要的假设检验数量减至最少。这些方法应用于纳瓦霍印第安人的三个不连锁二核苷酸重复位点以及吉拉河(皮马)印第安人的三个连锁HLA位点。两个数据集的似然函数经EM估计均达到最大值,并且检验策略对配子不平衡结构进行了有效描述。在这些应用之后,进行了一些模拟实验,以检验卡方分布对似然比统计量分布的近似程度。在大多数情况下,卡方分布严重低估了I型错误的概率。然而,有时它们也高估了I型错误的概率。因此,我们推荐使用重采样方法进行假设检验。