Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 77843-3136, United States.
Accid Anal Prev. 2010 Mar;42(2):741-9. doi: 10.1016/j.aap.2009.11.002. Epub 2009 Dec 16.
Factors that cause heterogeneity in crash data are often unknown to researchers and failure to accommodate such heterogeneity in statistical models can undermine the validity of empirical results. A recently proposed finite mixture for the negative binomial regression model has shown a potential advantage in addressing the unobserved heterogeneity as well as providing useful information about features of the population under study. Despite its usefulness, however, no study has been found to examine the performance of this finite mixture under various conditions of sample sizes and sample-mean values that are common in crash data analysis. This study investigated the bias associated with the Bayesian summary statistics (posterior mean and median) of dispersion parameters in the two-component finite mixture of negative binomial regression models. A simulation study was conducted using various sample sizes under different sample-mean values. Two prior specifications (non-informative and weakly-informative) on the dispersion parameter were also compared. The results showed that the posterior mean using the non-informative prior exhibited a high bias for the dispersion parameter and should be avoided when the dataset contains less than 2,000 observations (even for high sample-mean values). The posterior median showed much better bias properties, particularly at small sample sizes and small sample means. However, as the sample size increases, the posterior median using the non-informative prior also began to exhibit an upward-bias trend. In such cases, the posterior mean or median with the weakly-informative prior provided smaller bias. Based on simulation results, guidelines about the choice of priors and the summary statistics to use are presented for different sample sizes and sample-mean values.
导致碰撞数据异质性的因素通常不为研究人员所知,如果在统计模型中未能适应这种异质性,可能会破坏实证结果的有效性。最近提出的负二项回归模型的有限混合模型在解决未观察到的异质性方面显示出了潜在的优势,同时还提供了有关研究人群特征的有用信息。然而,尽管它很有用,但没有研究发现该有限混合模型在碰撞数据分析中常见的各种样本量和样本均值条件下的表现。本研究调查了与负二项回归模型的两分量有限混合模型中分散参数的贝叶斯汇总统计量(后验均值和中位数)相关的偏差。使用不同的样本均值进行了各种样本量的模拟研究。还比较了两种关于分散参数的先验规范(非信息性和弱信息性)。结果表明,对于分散参数,使用非信息性先验的后验均值表现出很高的偏差,因此当数据集包含少于 2000 个观测值时(即使对于高样本均值),应避免使用该后验均值。后验中位数显示出更好的偏差特性,尤其是在小样本量和小样本均值的情况下。然而,随着样本量的增加,使用非信息性先验的后验中位数也开始表现出向上偏差的趋势。在这种情况下,使用弱信息性先验的后验均值或中位数提供了较小的偏差。基于模拟结果,针对不同的样本量和样本均值,提出了关于先验和要使用的汇总统计量的选择指南。