Division of Health Informatics and Surveillance, Center for Surveillance, Epidemiology and Laboratory Services, Centers for Disease Control and Prevention, 1600 Clifton Rd NE, Mailstop E91, Atlanta, GA 30333. E-mail:
Centers for Disease Control and Prevention, Atlanta, Georgia.
Prev Chronic Dis. 2014 Mar 27;11:E50; quiz E50. doi: 10.5888/pcd11.130252.
Count data are often collected in chronic disease research, and sometimes these data have a skewed distribution. The number of unhealthy days reported in the Behavioral Risk Factor Surveillance System (BRFSS) is an example of such data: most respondents report zero days. Studies have either categorized the Healthy Days measure or used linear regression models. We used alternative regression models for these count data and examined the effect on statistical inference.
Using responses from participants aged 35 years or older from 12 states that included a homeownership question in their 2009 BRFSS, we compared 5 multivariate regression models--logistic, linear, Poisson, negative binomial, and zero-inflated negative binomial--with respect to 1) how well the modeled data fit the observed data and 2) how model selections affect inferences.
Most respondents (66.8%) reported zero mentally unhealthy days. The distribution was highly skewed (variance = 58.7, mean = 3.3 d). Zero-inflated negative binomial regression provided the best-fitting model, followed by negative binomial regression. A significant independent association between homeownership and number of mentally unhealthy days was not found in the logistic, linear, or Poisson regression model but was found in the negative binomial model. The zero-inflated negative binomial model showed that homeowners were 24% more likely than nonowners to have excess zero mentally unhealthy days (adjusted odds ratio, 1.24; 95% confidence interval, 1.08-1.43), but it did not show an association between homeownership and the number of unhealthy days.
Our comparison of regression models indicates the importance of examining data distribution and selecting models with appropriate assumptions. Otherwise, statistical inferences might be misleading.
慢性病研究中经常会收集到计数数据,有时这些数据呈偏态分布。行为风险因素监测系统(BRFSS)中报告的不健康天数就是此类数据的一个例子:大多数受访者报告的天数为零。研究人员要么对健康天数进行分类,要么使用线性回归模型。我们对这些计数数据使用了替代回归模型,并研究了其对统计推断的影响。
我们使用了来自 12 个州的 35 岁及以上的受访者的数据,这些州在其 2009 年的 BRFSS 中包含了一个住房拥有情况问题,我们比较了 5 种多元回归模型(逻辑回归、线性回归、泊松回归、负二项回归和零膨胀负二项回归),比较的方面包括:1)模型数据对观测数据的拟合程度,以及 2)模型选择如何影响推断。
大多数受访者(66.8%)报告没有精神不健康的天数。该分布高度偏态(方差=58.7,均值=3.3 天)。零膨胀负二项回归提供了最佳拟合模型,其次是负二项回归。在逻辑回归、线性回归或泊松回归模型中,住房拥有情况与精神不健康天数之间没有发现显著的独立关联,但在负二项回归模型中发现了这种关联。零膨胀负二项回归模型表明,房主比非房主更有可能出现过多的零精神不健康天数,调整后的优势比为 1.24(95%置信区间,1.08-1.43),但它没有显示住房拥有情况与不健康天数之间的关联。
我们对回归模型的比较表明,检查数据分布并选择具有适当假设的模型非常重要。否则,统计推断可能会产生误导。