Department of Biostatistics and Health Informatics, Institute of Psychology, Psychiatry and Neuroscience, Kings College London, London, UK.
Psychology and Language Sciences, University College London, London, UK.
BMC Med Res Methodol. 2021 Aug 17;21(1):173. doi: 10.1186/s12874-021-01353-3.
The use of auxiliary variables with maximum likelihood parameter estimation for surveys that miss data by design is not a widespread approach, despite its documented improved efficiency over traditional approaches that deploy sampling weights. Although efficiency gains from the use of Normally distributed auxiliary variables in a model have been recorded in the literature, little is known about the effects of non-Normal auxiliary variables in the parameter estimation.
We simulate growth data to mimic SCALES, a two-stage survey of language development with a screening phase (stage one) for which data are observed for the whole sample and an intensive assessments phase (stage two), for which data are observed for a sub-sample, selected using stratified random sampling. In the simulation, we allow a fully observed Poisson distributed stratification criterion to be correlated with the partially observed model responses and develop five generalised structural equation growth models that host the auxiliary information from this criterion. We compare these models with each other and with a weighted growth model in terms of bias, efficiency, and coverage. We finally apply our best performing model to SCALES data and show how to obtain growth parameters and population norms.
Parameter estimation from a model that incorporates a non-Normal auxiliary variable is unbiased and more efficient than its weighted counterpart. The auxiliary variable method is capable of producing efficient population percentile norms and velocities.
The deployment of a fully observed variable that dominates the selection of the sample and correlates strongly with the incomplete variable of interest appears beneficial for the estimation process.
尽管有文献记录表明,与使用抽样权重的传统方法相比,通过设计缺失数据的调查中使用辅助变量进行最大似然参数估计可以提高效率,但这种方法并没有得到广泛应用。尽管在模型中使用正态分布辅助变量可以提高效率,但对于辅助变量在参数估计中的非正态分布效应知之甚少。
我们模拟了生长数据,以模拟 SCALES,这是一项语言发展的两阶段调查,具有筛选阶段(第一阶段),对于整个样本观察数据,以及密集评估阶段(第二阶段),对于使用分层随机抽样选择的子样本观察数据。在模拟中,我们允许完全观察到的泊松分布分层标准与部分观察到的模型响应相关联,并开发了五个广义结构方程生长模型,这些模型包含了来自该标准的辅助信息。我们比较了这些模型彼此之间以及与加权生长模型在偏差、效率和覆盖率方面的差异。最后,我们将我们表现最好的模型应用于 SCALES 数据,并展示如何获得生长参数和总体百分位数规范。
包含非正态辅助变量的模型的参数估计是无偏的,并且比其加权对应物更有效。辅助变量方法能够生成有效的总体百分位数规范和速度。
对于估计过程来说,部署一个完全观察到的变量,该变量主导样本的选择,并与感兴趣的不完全变量强相关,似乎是有益的。