Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD 20850, USA.
Biostatistics. 2022 Jul 18;23(3):875-890. doi: 10.1093/biostatistics/kxaa060.
When validating a risk model in an independent cohort, some predictors may be missing for some subjects. Missingness can be unplanned or by design, as in case-cohort or nested case-control studies, in which some covariates are measured only in subsampled subjects. Weighting methods and imputation are used to handle missing data. We propose methods to increase the efficiency of weighting to assess calibration of a risk model (i.e. bias in model predictions), which is quantified by the ratio of the number of observed events, $\mathcal{O}$, to expected events, $\mathcal{E}$, computed from the model. We adjust known inverse probability weights by incorporating auxiliary information available for all cohort members. We use survey calibration that requires the weighted sum of the auxiliary statistics in the complete data subset to equal their sum in the full cohort. We show that a pseudo-risk estimate that approximates the actual risk value but uses only variables available for the entire cohort is an excellent auxiliary statistic to estimate $\mathcal{E}$. We derive analytic variance formulas for $\mathcal{O}/\mathcal{E}$ with adjusted weights. In simulations, weight adjustment with pseudo-risk was much more efficient than inverse probability weighting and yielded consistent estimates even when the pseudo-risk was a poor approximation. Multiple imputation was often efficient but yielded biased estimates when the imputation model was misspecified. Using these methods, we assessed calibration of an absolute risk model for second primary thyroid cancer in an independent cohort.
在独立队列中验证风险模型时,对于某些受试者,某些预测因子可能会缺失。缺失可能是计划外的,也可能是出于设计目的,如病例队列或巢式病例对照研究,其中仅对部分抽样受试者测量了某些协变量。可以使用加权方法和插补来处理缺失数据。我们提出了一些方法来提高加权效率,以评估风险模型的校准(即模型预测的偏差),这可以通过从模型计算的观察到的事件数 $\mathcal{O}$ 与预期事件数 $\mathcal{E}$ 的比值来量化。我们通过结合所有队列成员可用的辅助信息来调整已知的逆概率权重。我们使用需要加权完整数据子集的辅助统计量的和等于其在整个队列中的和的调查校准。我们表明,一种近似实际风险值但仅使用整个队列中可用的变量的伪风险估计是估计 $\mathcal{E}$ 的极好辅助统计量。我们推导出了带有调整权重的 $\mathcal{O}/\mathcal{E}$ 的解析方差公式。在模拟中,使用伪风险进行权重调整比逆概率加权更有效,即使伪风险的近似值较差,也能得到一致的估计。当插补模型指定错误时,多次插补通常是有效的,但会产生有偏估计。我们使用这些方法评估了在独立队列中第二原发甲状腺癌的绝对风险模型的校准。