广义线性模型下因变量依赖抽样设计的模型误设与稳健分析。

Model misspecification and robust analysis for outcome-dependent sampling designs under generalized linear models.

机构信息

Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA.

Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

出版信息

Stat Med. 2023 Apr 30;42(9):1338-1352. doi: 10.1002/sim.9673. Epub 2023 Feb 9.

DOI:10.1002/sim.9673

PMID:36757145

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10883476/

Abstract

Outcome-dependent sampling (ODS) is a commonly used class of sampling designs to increase estimation efficiency in settings where response information (and possibly adjuster covariates) is available, but the exposure is expensive and/or cumbersome to collect. We focus on ODS within the context of a two-phase study, where in Phase One the response and adjuster covariate information is collected on a large cohort that is representative of the target population, but the expensive exposure variable is not yet measured. In Phase Two, using response information from Phase One, we selectively oversample a subset of informative subjects in whom we collect expensive exposure information. Importantly, the Phase Two sample is no longer representative, and we must use ascertainment-correcting analysis procedures for valid inferences. In this paper, we focus on likelihood-based analysis procedures, particularly a conditional-likelihood approach and a full-likelihood approach. Whereas the full-likelihood retains incomplete Phase One data for subjects not selected into Phase Two, the conditional-likelihood explicitly conditions on Phase Two sample selection (ie, it is a "complete case" analysis procedure). These designs and analysis procedures are typically implemented assuming a known, parametric model for the response distribution. However, in this paper, we approach analyses implementing a novel semi-parametric extension to generalized linear models (SPGLM) to develop likelihood-based procedures with improved robustness to misspecification of distributional assumptions. We specifically focus on the common setting where standard GLM distributional assumptions are not satisfied (eg, misspecified mean/variance relationship). We aim to provide practical design guidance and flexible tools for practitioners in these settings.

摘要

基于结果的抽样 (ODS) 是一种常用的抽样设计方法，用于在存在响应信息（可能还有调整器协变量）的情况下提高估计效率，但暴露情况昂贵且/或难以收集。我们专注于两阶段研究背景下的 ODS，在第一阶段，在具有代表性的目标人群的大样本中收集响应和调整器协变量信息，但尚未测量昂贵的暴露变量。在第二阶段，利用第一阶段的响应信息，我们有选择地对信息丰富的部分受试者进行过度抽样，在这些受试者中我们收集昂贵的暴露信息。重要的是，第二阶段的样本不再具有代表性，我们必须使用确证校正分析程序进行有效推断。在本文中，我们专注于基于似然的分析程序，特别是条件似然方法和完全似然方法。虽然完全似然为未被选入第二阶段的受试者保留了不完全的第一阶段数据，但条件似然明确条件是第二阶段样本选择（即，它是一种“完整案例”分析程序）。这些设计和分析程序通常在假设响应分布的已知、参数模型的情况下实施。然而，在本文中，我们采用广义线性模型（GLM）的新半参数扩展来实施分析，以开发具有改进的分布假设指定稳健性的基于似然的程序。我们特别关注标准 GLM 分布假设不满足的常见情况（例如，指定错误的均值/方差关系）。我们旨在为这些情况下的从业者提供实用的设计指导和灵活的工具。