Green James A
School of Allied Health, University of Limerick, Limerick, Ireland.
Physical Activity for Health Research Cluster (Health Research Institute), University of Limerick, Limerick, Ireland.
Health Psychol Behav Med. 2021 May 6;9(1):436-455. doi: 10.1080/21642850.2021.1920416.
Dependent variables in health psychology are often counts, for example, of a behaviour or number of engagements with an intervention. These counts can be very strongly skewed, and/or contain large numbers of zeros as well as extreme outliers. For example, 'How many cigarettes do you smoke on an average day?' The modal answer may be zero but may range from 0 to 40+. The same can be true for minutes of moderate-to-vigorous physical activity. For some people, this may be near zero, but take on extreme values for someone training for a marathon. Typical analytical strategies for this data involve explicit (or implied) transformations (smoker v. non-smoker, log transformations). However, these data types are 'counts' (i.e. non-negative whole numbers) or quasi-counts (time is ratio but discrete minutes of activity could be analysed as a count), and can be modelled using count distributions - including the Poisson and negative binomial distribution (and their zero-inflated and hurdle extensions, which alloweven more zeros). In this tutorial paper I demonstrate (in R, Jamovi, and SPSS) the easy application of these models to health psychology data, and their advantages over alternative ways of analysing this type of data using two datasets - one highly dispersed dependent variable (number of views on YouTube, and another with a large number of zeros (number of days on which symptoms were reported over a month). The negative binomial distribution had the best fit for the overdispersed number of views on YouTube. Negative binomial, and zero-inflated negative binomial were both good fits for the symptom data with over-abundant zeros. In both cases, count distributions provided not just a better fit but would lead to different conclusions compared to the poorly fitting traditional regression/linear models.
健康心理学中的因变量通常是计数,例如某种行为的计数或参与某项干预的次数。这些计数可能严重偏态,和/或包含大量零值以及极端异常值。例如,“你平均每天吸多少支烟?”典型答案可能是零,但范围可能从0到40多支。中度至剧烈身体活动的分钟数情况也一样。对一些人来说,这个数字可能接近零,但对于正在为马拉松训练的人来说可能会有极端值。针对这类数据的典型分析策略涉及显式(或隐含)变换(吸烟者与非吸烟者,对数变换)。然而,这些数据类型是“计数”(即非负整数)或准计数(时间是比率变量,但离散的活动分钟数可以作为计数来分析),并且可以使用计数分布进行建模——包括泊松分布和负二项分布(以及它们的零膨胀和障碍扩展,这允许出现更多零值)。在本教程论文中,我展示了(在R、Jamovi和SPSS中)这些模型在健康心理学数据中的轻松应用,以及与使用两个数据集分析这类数据的其他方法相比它们的优势——一个是高度分散的因变量(YouTube上的观看次数),另一个有大量零值(一个月内报告症状的天数)。负二项分布对YouTube上过度分散的观看次数拟合最佳。负二项分布和零膨胀负二项分布对有大量零值的症状数据拟合都很好。在这两种情况下,计数分布不仅提供了更好的拟合,而且与拟合不佳的传统回归/线性模型相比会得出不同的结论。