Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA.
Pharmacoepidemiol Drug Saf. 2012 May;21 Suppl 2(0 2):13-20. doi: 10.1002/pds.3248.
Electronic healthcare databases are commonly used in comparative effectiveness and safety research of therapeutics. Many databases now include additional confounder information in a subset of the study population through data linkage or data collection. We described and compared existing methods for analyzing such datasets.
Using data from The Health Improvement Network and the relation between non-steroidal anti-inflammatory drugs and upper gastrointestinal bleeding as an example, we employed several methods to handle partially missing confounder information.
The crude odds ratio (OR) of upper gastrointestinal bleeding was 1.50 (95% confidence interval: 0.98, 2.28) among selective cyclo-oxygenase-2 inhibitor initiators (n = 43 569) compared with traditional non-steroidal anti-inflammatory drug initiators (n = 411 616). The OR dropped to 0.81 (0.52, 1.27) upon adjustment for confounders recorded for all patients. When further considering three additional variables missing in 22% of the study population (smoking, alcohol consumption, body mass index), the OR was between 0.80 and 0.83 for the missing-category approach, the missing-indicator approach, single imputation by the most common category, multiple imputation by chained equations, and propensity score calibration. The OR was 0.65 (0.39, 1.09) and 0.67 (0.38, 1.16) for the unweighted and the inverse probability weighted complete-case analysis, respectively.
Existing methods for handling partially missing confounder data require different assumptions and may produce different results. The unweighted complete-case analysis, the missing-category/indicator approach, and single imputation require often unrealistic assumptions and should be avoided. In this study, differences across methods were not substantial, likely due to relatively low proportion of missingness and weak confounding effect by the three additional variables upon adjustment for other variables.
电子医疗数据库常用于治疗方法的疗效和安全性的比较研究。许多数据库现在通过数据链接或数据收集,在研究人群的一部分中包含了额外的混杂因素信息。我们描述并比较了分析此类数据集的现有方法。
使用来自健康改善网络的数据和非甾体抗炎药与上消化道出血之间的关系作为示例,我们采用了几种方法来处理部分缺失混杂因素信息。
选择性环氧化酶-2 抑制剂使用者(n=43569)与传统非甾体抗炎药使用者(n=411616)相比,上消化道出血的粗比值比(OR)为 1.50(95%置信区间:0.98,2.28)。当调整所有患者记录的混杂因素后,OR 降至 0.81(0.52,1.27)。当进一步考虑到研究人群中 22%缺失的三个额外变量(吸烟、饮酒、体重指数)时,缺失类别法、缺失指标法、最常见类别单值插补、链式方程多重插补和倾向评分校准的 OR 分别在 0.80 到 0.83 之间。未加权和逆概率加权完全病例分析的 OR 分别为 0.65(0.39,1.09)和 0.67(0.38,1.16)。
处理部分缺失混杂因素数据的现有方法需要不同的假设,并且可能产生不同的结果。未加权完全病例分析、缺失类别/指标法和单值插补需要经常不切实际的假设,应避免使用。在这项研究中,不同方法之间的差异不大,这可能是由于缺失率相对较低,并且在调整其他变量后,三个额外变量的混杂作用较弱。