Chen Qixuan, Elliott Michael R, Little Roderick J A
Department of Biostatistics, Columbia University Mailman School of Public Health, 722 West 168 Street, New York, NY 10032.
Department of Biostatistics, University of Michigan School of Public Health, 1420 Washington Heights, Ann Arbor, MI 48109.
Surv Methodol. 2012 Dec;38(2):203-214. Epub 2012 Dec 19.
This paper develops two Bayesian methods for inference about finite population quantiles of continuous survey variables from unequal probability sampling. The first method estimates cumulative distribution functions of the continuous survey variable by fitting a number of probit penalized spline regression models on the inclusion probabilities. The finite population quantiles are then obtained by inverting the estimated distribution function. This method is quite computationally demanding. The second method predicts non-sampled values by assuming a smoothly-varying relationship between the continuous survey variable and the probability of inclusion, by modeling both the mean function and the variance function using splines. The two Bayesian spline-model-based estimators yield a desirable balance between robustness and efficiency. Simulation studies show that both methods yield smaller root mean squared errors than the sample-weighted estimator and the ratio and difference estimators described by Rao, Kovar, and Mantel (RKM 1990), and are more robust to model misspecification than the regression through the origin model-based estimator described in Chambers and Dunstan (1986). When the sample size is small, the 95% credible intervals of the two new methods have closer to nominal confidence coverage than the sample-weighted estimator.
本文针对不等概率抽样下连续调查变量的有限总体分位数推断,开发了两种贝叶斯方法。第一种方法通过在包含概率上拟合多个概率惩罚样条回归模型,来估计连续调查变量的累积分布函数。然后通过对估计的分布函数求逆来获得有限总体分位数。此方法在计算上要求较高。第二种方法通过假设连续调查变量与包含概率之间存在平滑变化的关系,利用样条对均值函数和方差函数进行建模,来预测未抽样值。这两种基于贝叶斯样条模型的估计量在稳健性和效率之间实现了理想的平衡。模拟研究表明,与样本加权估计量以及Rao、Kovar和Mantel(1990年)描述的比率和差值估计量相比,这两种方法均产生更小的均方根误差,并且比Chambers和Dunstan(1986年)描述的基于原点回归模型的估计量对模型误设更具稳健性。当样本量较小时,这两种新方法的95%可信区间比样本加权估计量更接近名义置信覆盖范围。