D'Angelo Gina M, Kamboh M Ilyas, Feingold Eleanor
Division of Biostatistics, Washington University School of Medicine, St. Louis, Mo., USA.
Hum Hered. 2010;69(3):171-83. doi: 10.1159/000273732.
Missing genotype data in a candidate gene association study can make it difficult to model the effects of multiple genetic variants simultaneously. In particular, when regression models are used to model phenotype as a function of SNP genotypes in several different genes, the most common approach is a complete case analysis, in which only individuals with no missing genotypes are included. But this can lead to substantial reduction in sample size and thus potential bias and loss in efficiency. A number of other methods for handling missing data are applicable, but have rarely been used in this context. The purpose of this paper is to describe how several standard methods for handling missing data can be applied or adapted to this problem, and to compare their performance using a simulation study. We demonstrate these techniques using an Alzheimer's disease association study. We show that the expectation-maximization algorithm and multiple imputation with a bootstrapped expectation-maximization sampling algorithm have the best properties of all the estimators studied.
在候选基因关联研究中,缺失的基因型数据可能会使同时对多个基因变异的效应进行建模变得困难。特别是,当使用回归模型将表型建模为几个不同基因中SNP基因型的函数时,最常见的方法是完全病例分析,即只纳入没有缺失基因型的个体。但这可能会导致样本量大幅减少,从而产生潜在的偏差和效率损失。还有许多其他处理缺失数据的方法也适用,但在这种情况下很少使用。本文的目的是描述几种处理缺失数据的标准方法如何应用或适用于这个问题,并通过模拟研究比较它们的性能。我们使用一项阿尔茨海默病关联研究来演示这些技术。我们表明,期望最大化算法以及带有自举期望最大化抽样算法的多重填补,在所研究的所有估计方法中具有最佳性能。