The Institute for Clinical Research and Learning Health Care, The University of Texas Health Science Center at Houston, Houston, TX, USA.
Trials. 2023 Sep 27;24(1):613. doi: 10.1186/s13063-023-07648-8.
Two characteristics of commonly used outcomes in medical research are zero inflation and non-negative integers; examples include the number of hospital admissions or emergency department visits, where the majority of patients will have zero counts. Zero-inflated regression models were devised to analyze this type of data. However, the performance of zero-inflated regression models or the properties of data best suited for these analyses have not been thoroughly investigated.
We conducted a simulation study to evaluate the performance of two generalized linear models, negative binomial and zero-inflated negative binomial, for analyzing zero-inflated count data. Simulation scenarios assumed a randomized controlled trial design and varied the true underlying distribution, sample size, and rate of zero inflation. We compared the models in terms of bias, mean squared error, and coverage. Additionally, we used logistic regression to determine which data properties are most important for predicting the best-fitting model.
We first found that, regardless of the rate of zero inflation, there was little difference between the conventional negative binomial and its zero-inflated counterpart in terms of bias of the marginal treatment group coefficient. Second, even when the outcome was simulated from a zero-inflated distribution, a negative binomial model was favored above its ZI counterpart in terms of the Akaike Information Criterion. Third, the mean and skewness of the non-zero part of the data were stronger predictors of model preference than the percentage of zero counts. These results were not affected by the sample size, which ranged from 60 to 800.
We recommend that the rate of zero inflation and overdispersion in the outcome should not be the sole and main justification for choosing zero-inflated regression models. Investigators should also consider other data characteristics when choosing a model for count data. In addition, if the performance of the NB and ZINB regression models is reasonably comparable even with ZI outcomes, we advocate the use of the NB regression model due to its clear and straightforward interpretation of the results.
医学研究中常用的两种结果具有零膨胀和非负整数的特点;例如,住院或急诊就诊的次数,其中大多数患者的就诊次数为零。零膨胀回归模型是为分析这种类型的数据而设计的。然而,零膨胀回归模型的性能或最适合这些分析的数据特性尚未得到彻底研究。
我们进行了一项模拟研究,以评估两种广义线性模型,负二项式和零膨胀负二项式,用于分析零膨胀计数数据。模拟方案假设为随机对照试验设计,并改变了真实的潜在分布、样本量和零膨胀率。我们比较了这些模型在偏差、均方误差和覆盖率方面的表现。此外,我们使用逻辑回归来确定哪些数据特性对预测最佳拟合模型最重要。
我们首先发现,无论零膨胀率如何,传统的负二项式与其零膨胀对应项在边缘治疗组系数的偏差方面几乎没有差异。其次,即使结果是从零膨胀分布中模拟出来的,负二项式模型在 Akaike 信息准则方面也优于其 ZI 对应项。第三,数据中非零部分的均值和偏度是预测模型偏好的更强指标,而不是零计数的百分比。这些结果不受样本量的影响,样本量范围从 60 到 800。
我们建议,零膨胀率和结果中的过分散度不应是选择零膨胀回归模型的唯一和主要依据。研究人员在为计数数据选择模型时,还应考虑其他数据特征。此外,如果 NB 和 ZINB 回归模型的性能即使在存在 ZI 结果的情况下也相当可比,我们提倡使用 NB 回归模型,因为它对结果的解释清晰明了。