ICES, Toronto, Ontario, Canada.
Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
Stat Med. 2023 May 10;42(10):1525-1541. doi: 10.1002/sim.9685. Epub 2023 Feb 19.
We examined the setting in which a variable that is subject to missingness is used both as an inclusion/exclusion criterion for creating the analytic sample and subsequently as the primary exposure in the analysis model that is of scientific interest. An example is cancer stage, where patients with stage IV cancer are often excluded from the analytic sample, and cancer stage (I to III) is an exposure variable in the analysis model. We considered two analytic strategies. The first strategy, referred to as "exclude-then-impute," excludes subjects for whom the observed value of the target variable is equal to the specified value and then uses multiple imputation to complete the data in the resultant sample. The second strategy, referred to as "impute-then-exclude," first uses multiple imputation to complete the data and then excludes subjects based on the observed or filled-in values in the completed samples. Monte Carlo simulations were used to compare five methods (one based on "exclude-then-impute" and four based on "impute-then-exclude") along with the use of a complete case analysis. We considered both missing completely at random and missing at random missing data mechanisms. We found that an impute-then-exclude strategy using substantive model compatible fully conditional specification tended to have superior performance across 72 different scenarios. We illustrated the application of these methods using empirical data on patients hospitalized with heart failure when heart failure subtype was used for cohort creation (excluding subjects with heart failure with preserved ejection fraction) and was also an exposure in the analysis model.
我们考察了一种情况,即一个存在缺失值的变量既被用作创建分析样本的纳入/排除标准,又被用作分析模型中具有科学意义的主要暴露因素。例如癌症分期,患有 IV 期癌症的患者通常会被排除在分析样本之外,而癌症分期(I 期至 III 期)则是分析模型中的一个暴露变量。我们考虑了两种分析策略。第一种策略称为“排除后插补”,它排除了目标变量的观测值等于指定值的受试者,然后使用多重插补来完成结果样本中的数据。第二种策略称为“插补后排除”,首先使用多重插补来完成数据,然后根据完成样本中的观测值或填充值排除受试者。我们使用蒙特卡罗模拟比较了五种方法(一种基于“排除后插补”,四种基于“插补后排除”)以及完整病例分析的使用情况。我们考虑了完全随机缺失和随机缺失缺失数据机制。我们发现,在 72 种不同的情况下,基于实质性模型兼容完全条件规范的插补后排除策略往往具有更好的性能。我们使用心力衰竭住院患者的实证数据说明了这些方法的应用,心力衰竭亚型用于队列创建(排除射血分数保留的心力衰竭患者),并且也是分析模型中的一个暴露因素。