Wang Lingxiao
Department of Statistics, University of Virginia, Charlottesville, VA 22903, United States.
National Cancer Institute, Division of Cancer Epidemiology & Genetics, Biostatistics Branch, Rockville, MD 20850, United States.
Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf092.
Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the "pooled design" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.
两阶段抽样设计在流行病学研究和大规模健康调查中经常被应用。在这类设计中,某些变量仅在初始第一阶段样本的第二阶段随机子样本中收集,这通常是由于成本高昂、应答负担、数据收集或评估的限制等因素。因此,由于样本量减小,第二阶段样本估计量可能效率低下。在回归分析中,模型辅助校准方法已被用于提高第二阶段估计量的效率。然而,仅有有限的文献提供了校准估计量的有效有限总体推断,这些估计量使用了适当的校准辅助变量,同时考虑了第一阶段和第二阶段样本中的复杂样本设计。此外,没有文献考虑“合并设计”,即一些协变量仅在某些重复调查周期中测量。本文提出基于回归模型的得分函数,将第二阶段样本的样本权重校准到加权的第一阶段样本,该回归模型使用第一阶段样本的第二阶段变量预测值。我们建立了使用校准权重进行估计的一致性,并在复杂调查设计中的两阶段设计或合并设计下提供回归系数的方差估计。实证证据突出了与现有校准和插补方法相比,所提出校准方法的效率和稳健性。提供了来自国家健康和营养检查调查的数据示例。