Resnick Dean M, Cox Christine S, Mirel Lisa B
NORC at the University of Chicago, Bethesda, Maryland.
Centers for Disease Control and Prevention, National Center for Health Statistics, Hyattsville, Maryland.
Health Serv Outcomes Res Methodol. 2021 Feb 3;21:389-406. doi: 10.1007/s10742-021-00241-z.
While record linkage can expand analyses performable from survey microdata, it also incurs greater risk of privacy-encroaching disclosure. One way to mitigate this risk is to replace some of the information added through linkage with synthetic data elements. This paper describes a case study using the National Hospital Care Survey (NHCS), which collects patient records under a pledge of protecting patient privacy from a sample of U.S. hospitals for statistical analysis purposes. The NHCS data were linked to the National Death Index (NDI) to enhance the survey with mortality information. The added information from NDI linkage enables survival analyses related to hospitalization, but as the death information includes dates of death and detailed causes of death, having it joined with the patient records increases the risk of patient re-identification (albeit only for deceased persons). For this reason, an approach was tested to develop synthetic data that uses models from survival analysis to replace vital status and actual dates-of-death with synthetic values and uses classification tree analysis to replace actual causes of death with synthesized causes of death. The degree to which analyses performed on the synthetic data replicate results from analysis on the actual data is measured by comparing survival analysis parameter estimates from both data files. Because synthetic data only have value to the degree that they can be used to produce statistical estimates that are like those based on the actual data, this evaluation is an essential first step in assessing the potential utility of synthetic mortality data.
虽然记录链接可以扩展从调查微观数据中进行的分析,但它也带来了更大的侵犯隐私披露风险。减轻这种风险的一种方法是用合成数据元素替换通过链接添加的一些信息。本文描述了一个使用国家医院护理调查(NHCS)的案例研究,该调查在保护患者隐私的承诺下收集美国医院样本中的患者记录用于统计分析目的。NHCS数据与国家死亡索引(NDI)进行了链接,以用死亡率信息增强调查。来自NDI链接的附加信息使得能够进行与住院相关的生存分析,但由于死亡信息包括死亡日期和详细死因,将其与患者记录相结合会增加患者重新识别的风险(尽管仅针对死者)。出于这个原因,测试了一种方法来开发合成数据,该方法使用生存分析模型用合成值替换生命状态和实际死亡日期,并使用分类树分析用合成死因替换实际死因。通过比较两个数据文件的生存分析参数估计值来衡量对合成数据进行的分析复制实际数据分析结果的程度。由于合成数据只有在能够用于产生类似于基于实际数据的统计估计时才有价值,因此这种评估是评估合成死亡率数据潜在效用的重要第一步。