Roosa Kimberlyn, Luo Ruiyan, Chowell Gerardo
Department of Population Health Sciences, School of Public Health, Georgia State University, Atlanta, GA, USA.
Division of International Epidemiology and Population Studies, Fogarty International Center, National Institute of Health, Bethesda, MD, USA.
Math Biosci Eng. 2019 May 16;16(5):4299-4313. doi: 10.3934/mbe.2019214.
The Poisson distribution is commonly assumed as the error structure for count data; however, empirical data may exhibit greater variability than expected based on a given statistical model. Greater variability could point to model misspecification, such as missing crucial information about the epidemiology of the disease or changes in population behavior. When the mechanism producing the apparent overdispersion is unknown, it is typically assumed that the variance in the data exceeds the mean (by some scaling factor). Thus, a probability distribution that allows for overdispersion (negative binomial, for example) may better represent the data. Here, we utilize simulation studies to assess how misspecifying the error structure affects parameter estimation results, specifically bias and uncertainty, as a function of the level of random noise in the data. We compare results for two parameter estimation methods: nonlinear least squares and maximum likelihood estimation with Poisson error structure. We analyze two phenomenological models the generalized growth model and generalized logistic growth model to assess how results of parameter estimation are affected by the level of overdispersion underlying in the data. We use simulation to obtain confidence intervals and mean squared error of parameter estimates. We also analyze the impact of the amount of data, or ascending phase length, on the results of the generalized growth model for increasing levels of overdispersion. The results show a clear pattern of increasing uncertainty, or confidence interval width, as the overdispersion in the data increases. While maximum likelihood estimation consistently yields narrower confidence intervals and smaller mean squared error, differences between the two methods were minimal and not practically significant. At moderate levels of overdispersion, both estimation methods yielded similar performance. Importantly, it is shown that issues of parameter uncertainty and bias in the presence of overdispersion can be mitigated with the inclusion of more data.
泊松分布通常被假定为计数数据的误差结构;然而,基于给定的统计模型,经验数据可能表现出比预期更大的变异性。更大的变异性可能表明模型设定错误,例如遗漏了有关疾病流行病学的关键信息或人群行为的变化。当产生明显过度离散的机制未知时,通常假定数据中的方差超过均值(乘以某个比例因子)。因此,一种允许过度离散的概率分布(例如负二项分布)可能更能代表数据。在这里,我们利用模拟研究来评估错误指定误差结构如何影响参数估计结果,特别是偏差和不确定性,作为数据中随机噪声水平的函数。我们比较两种参数估计方法的结果:非线性最小二乘法和具有泊松误差结构的最大似然估计法。我们分析两个现象学模型——广义增长模型和广义逻辑增长模型,以评估参数估计结果如何受到数据中潜在过度离散水平的影响。我们使用模拟来获得参数估计的置信区间和均方误差。我们还分析了数据量或上升阶段长度对广义增长模型结果的影响,以研究过度离散水平增加的情况。结果表明,随着数据中过度离散的增加,不确定性或置信区间宽度有明显的增加趋势。虽然最大似然估计始终产生更窄的置信区间和更小的均方误差,但两种方法之间的差异很小且实际上并不显著。在适度的过度离散水平下,两种估计方法表现相似。重要的是,研究表明,在存在过度离散的情况下,通过纳入更多数据可以减轻参数不确定性和偏差问题。