Gutman Roee, Afendulis Christopher C, Zaslavsky Alan M
Department of Biostatistics, Brown University, Providence, RI 02912.
J Am Stat Assoc. 2013 Jan 1;108(501):34-47. doi: 10.1080/01621459.2012.726889.
End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled efficient calculations despite the large dataset (≈1.7 million cases). The procedure generates datasets in which the matches between the two files are imputed. The datasets can be analyzed independently and results combined using Rubin's multiple imputation rules. Our approach can be applied in other file linking applications.
临终医疗费用在所有医疗保健支出中占很大比例。这些费用通过医疗保险理赔的服务成本和死亡证明上的死因(CoD)进行研究。由于缺乏将这两个数据集联系起来的唯一标识符,共同变量仅为33%的死亡病例找到了唯一匹配项。其余病例形成了包含多个病例的单元格(每个文件病例数相等的单元格中占32%,病例数不相等的单元格中占35%)。我们从模型参数的联合后验分布以及每个单元格中链接两个文件病例的排列中进行抽样。链接模型包括死亡地点对死因及其他参数的回归,以及具有单调缺失数据模式的成本度量对死因及其他人口统计学特征的回归。通过枚举小单元格的精确分布和对大单元格使用Metropolis算法对排列进行抽样。尽管数据集很大(约170万个病例),稀疏矩阵数据结构仍能实现高效计算。该程序生成两个文件之间匹配项被插补的数据集。这些数据集可以独立分析,并使用鲁宾多重插补规则合并结果。我们的方法可应用于其他文件链接应用程序。