Kassahun Wondwosen, Neyens Thomas, Molenberghs Geert, Faes Christel, Verbeke Geert
Department of Epidemiology and Biostatistics, Jimma University, Ethiopia.
Stat Med. 2014 Nov 10;33(25):4402-19. doi: 10.1002/sim.6237. Epub 2014 Jun 23.
Count data are collected repeatedly over time in many applications, such as biology, epidemiology, and public health. Such data are often characterized by the following three features. First, correlation due to the repeated measures is usually accounted for using subject-specific random effects, which are assumed to be normally distributed. Second, the sample variance may exceed the mean, and hence, the theoretical mean-variance relationship is violated, leading to overdispersion. This is usually allowed for based on a hierarchical approach, combining a Poisson model with gamma distributed random effects. Third, an excess of zeros beyond what standard count distributions can predict is often handled by either the hurdle or the zero-inflated model. A zero-inflated model assumes two processes as sources of zeros and combines a count distribution with a discrete point mass as a mixture, while the hurdle model separately handles zero observations and positive counts, where then a truncated-at-zero count distribution is used for the non-zero state. In practice, however, all these three features can appear simultaneously. Hence, a modeling framework that incorporates all three is necessary, and this presents challenges for the data analysis. Such models, when conditionally specified, will naturally have a subject-specific interpretation. However, adopting their purposefully modified marginalized versions leads to a direct marginal or population-averaged interpretation for parameter estimates of covariate effects, which is the primary interest in many applications. In this paper, we present a marginalized hurdle model and a marginalized zero-inflated model for correlated and overdispersed count data with excess zero observations and then illustrate these further with two case studies. The first dataset focuses on the Anopheles mosquito density around a hydroelectric dam, while adolescents' involvement in work, to earn money and support their families or themselves, is studied in the second example. Sub-models, which result from omitting zero-inflation and/or overdispersion features, are also considered for comparison's purpose. Analysis of the two datasets showed that accounting for the correlation, overdispersion, and excess zeros simultaneously resulted in a better fit to the data and, more importantly, that omission of any of them leads to incorrect marginal inference and erroneous conclusions about covariate effects.
在许多应用领域,如生物学、流行病学和公共卫生领域,计数数据是随时间重复收集的。这类数据通常具有以下三个特征。首先,由于重复测量产生的相关性通常通过特定于个体的随机效应来解释,这些随机效应假定服从正态分布。其次,样本方差可能超过均值,因此,理论上的均值 - 方差关系被违反,导致过度离散。这通常基于一种分层方法来考虑,即将泊松模型与伽马分布的随机效应相结合。第三,超出标准计数分布所能预测的过多零值,通常由障碍模型或零膨胀模型来处理。零膨胀模型假定有两个过程作为零值的来源,并将计数分布与离散点质量作为混合分布相结合,而障碍模型则分别处理零观测值和正计数,其中对于非零状态使用零截断计数分布。然而,在实际中,这三个特征可能会同时出现。因此,需要一个包含所有这三个特征的建模框架,而这给数据分析带来了挑战。这类模型在条件设定时,自然会有特定于个体的解释。然而,采用其经过特意修改的边缘化版本会为协变量效应的参数估计带来直接的边际或总体平均解释,这是许多应用中的主要关注点。在本文中,我们针对具有过多零观测值的相关且过度离散的计数数据,提出了一个边缘化障碍模型和一个边缘化零膨胀模型,然后通过两个案例研究进一步说明这些模型。第一个数据集关注一座水电站大坝周围的按蚊密度,而在第二个例子中研究了青少年为赚钱和养活家人或自己而参与工作的情况。为了进行比较,还考虑了省略零膨胀和/或过度离散特征后得到的子模型。对这两个数据集的分析表明,同时考虑相关性、过度离散和过多零值能使模型更好地拟合数据,更重要的是,省略其中任何一个都会导致关于协变量效应的边际推断不正确和得出错误结论。