Mallick Himel, Tiwari Hemant K
Department of Biostatistics, Harvard T. H. Chan School of Public Health, Harvard UniversityBoston, MA, USA; Program of Medical and Population Genetics, Broad Institute of MIT and HarvardCambridge, MA, USA.
Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham Birmingham, AL, USA.
Front Genet. 2016 Mar 30;7:32. doi: 10.3389/fgene.2016.00032. eCollection 2016.
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice.
计数数据在基因关联研究中越来越普遍,与基于标准假设预期的情况相比,有可能观察到过多的零计数。例如,在风湿病学中,数据通常是在一个人的多个关节或关节的多个子区域中收集的,由于大多数患者存在过多的零计数,表型中包含大量零并不罕见。大多数现有的统计方法假设计数表型遵循以下四种分布之一,并具有适当的离散处理机制:泊松分布、零膨胀泊松分布(ZIP)、负二项分布和零膨胀负二项分布(ZINB)。然而,它们在基因关联研究中的影响却鲜为人知。此外,关于它们在模型误设和变量选择方面的有用性的文献相对较少。在本文中,我们通过在各种疾病模型和连锁不平衡模式下模拟数据,研究了几种处理零膨胀计数数据的先进方法以及一种具有自适应LASSO惩罚的新型惩罚回归方法的性能。通过在估计过程中考虑数据自适应权重,所提出的方法在零膨胀计数表型的多SNP建模中提供了更大的灵活性。实现了一种嵌套在EM(期望最大化)算法中的快速坐标下降算法,用于同时估计模型参数和进行变量选择。结果表明,所提出的方法在存在多重共线性的情况下具有最优性能,以预测准确性和经验功效来衡量,随着样本量的增加,这一点尤为明显。此外,当模型误设时,竞争方法的I型错误率或多或少变得无法控制,这是实践中经常遇到的现象。