Williams Sharifa Z, Zou Jungang, Liu Yutao, Si Yajuan, Galea Sandro, Chen Qixuan
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, USA.
Edward J. Bloustein School of Planning and Public Policy, Rutgers University, New Brunswick, New Jersey, USA.
Stat Med. 2024 Dec 30;43(30):5803-5813. doi: 10.1002/sim.10270. Epub 2024 Nov 18.
Probability surveys are challenged by increasing nonresponse rates, resulting in biased statistical inference. Auxiliary information about populations can be used to reduce bias in estimation. Often continuous auxiliary variables in administrative records are first discretized before releasing to the public to avoid confidentiality breaches. This may weaken the utility of the administrative records in improving survey estimates, particularly when there is a strong relationship between continuous auxiliary information and the survey outcome. In this paper, we propose a two-step strategy, where the confidential continuous auxiliary data in the population are first utilized to estimate the response propensity score of the survey sample by statistical agencies, which is then included in a modified population data for data users. In the second step, data users who do not have access to confidential continuous auxiliary data conduct predictive survey inference by including discretized continuous variables and the propensity score as predictors using splines in a Bayesian model. We show by simulation that the proposed method performs well, yielding more efficient estimates of population means with 95% credible intervals providing better coverage than alternative approaches. We illustrate the proposed method using the Ohio Army National Guard Mental Health Initiative (OHARNG-MHI). The methods developed in this work are readily available in the R package AuxSurvey.
概率调查面临着无回应率不断上升的挑战,这导致统计推断出现偏差。关于总体的辅助信息可用于减少估计偏差。行政记录中的连续辅助变量通常在向公众发布之前先进行离散化处理,以避免泄露机密。这可能会削弱行政记录在改进调查估计方面的效用,特别是当连续辅助信息与调查结果之间存在很强的关系时。在本文中,我们提出了一种两步策略,即统计机构首先利用总体中的机密连续辅助数据来估计调查样本的回应倾向得分,然后将其纳入供数据用户使用的修正总体数据中。在第二步中,无法获取机密连续辅助数据的数据用户通过在贝叶斯模型中使用样条将离散化的连续变量和倾向得分作为预测变量来进行预测性调查推断。我们通过模拟表明,所提出的方法表现良好,能更有效地估计总体均值,其95%的可信区间比其他方法提供了更好的覆盖范围。我们使用俄亥俄陆军国民警卫队心理健康倡议(OHARNG-MHI)来说明所提出的方法。这项工作中开发的方法在R包AuxSurvey中很容易获得。