Liu Nianjun, Beerman Isabel, Lifton Richard, Zhao Hongyu
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, USA.
Genet Epidemiol. 2006 May;30(4):290-300. doi: 10.1002/gepi.20144.
It is common to have missing genotypes in practical genetic studies, but the exact underlying missing data mechanism is generally unknown to the investigators. Although some statistical methods can handle missing data, they usually assume that genotypes are missing at random, that is, at a given marker, different genotypes and different alleles are missing with the same probability. These include those methods on haplotype frequency estimation and haplotype association analysis. However, it is likely that this simple assumption does not hold in practice, yet few studies to date have examined the magnitude of the effects when this simplifying assumption is violated. In this study, we demonstrate that the violation of this assumption may lead to serious bias in haplotype frequency estimates, and haplotype association analysis based on this assumption can induce both false-positive and false-negative evidence of association. To address this limitation in the current methods, we propose a general missing data model to characterize missing data patterns across a set of two or more markers simultaneously. We prove that haplotype frequencies and missing data probabilities are identifiable if and only if there is linkage disequilibrium between these markers under our general missing data model. Simulation studies on the analysis of haplotypes consisting of two single nucleotide polymorphisms illustrate that our proposed model can reduce the bias both for haplotype frequency estimates and association analysis due to incorrect assumption on the missing data mechanism. Finally, we illustrate the utilities of our method through its application to a real data set.
在实际的遗传学研究中,基因型缺失的情况很常见,但研究人员通常并不清楚潜在的缺失数据机制的确切情况。尽管一些统计方法可以处理缺失数据,但它们通常假设基因型是随机缺失的,也就是说,在给定的标记处,不同的基因型和不同的等位基因以相同的概率缺失。这些方法包括那些用于单倍型频率估计和单倍型关联分析的方法。然而,在实际中这个简单的假设可能并不成立,然而迄今为止很少有研究考察当违反这个简化假设时效应的大小。在本研究中,我们证明违反这个假设可能会导致单倍型频率估计出现严重偏差,并且基于这个假设的单倍型关联分析可能会产生假阳性和假阴性的关联证据。为了解决当前方法中的这一局限性,我们提出了一个通用的缺失数据模型,以同时刻画一组两个或更多标记上的缺失数据模式。我们证明,在我们的通用缺失数据模型下,当且仅当这些标记之间存在连锁不平衡时,单倍型频率和缺失数据概率才是可识别的。对由两个单核苷酸多态性组成的单倍型进行分析的模拟研究表明,我们提出的模型可以减少由于对缺失数据机制的错误假设而导致的单倍型频率估计和关联分析中的偏差。最后,我们通过将我们的方法应用于一个实际数据集来说明其效用。