Loong Bronwyn, Zaslavsky Alan M, He Yulei, Harrington David P
Research School of Finance, Actuarial Studies and Applied Statistics, The Australian National University, Canberra, ACT 0200, Australia.
Stat Med. 2013 Oct 30;32(24):4139-61. doi: 10.1002/sim.5841. Epub 2013 May 13.
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality.
统计机构已开始对主要调查的公共使用数据进行部分合成,通过用多重插补替换高披露风险和敏感变量来保护受访者身份和敏感属性的机密性。迄今为止,合成数据技术在大规模医疗保健调查数据中的应用很少。在此,我们描述了癌症护理结果研究与监测(CanCORS)项目收集的调查数据的部分合成,该项目是对美国肺癌或结直肠癌患者的经历、治疗和结果进行的一项全面观察性研究。我们回顾了部分合成数据的推断方法,并讨论了合成高披露风险变量的选择、插补模型的设定以及识别披露风险评估。我们通过复制已发表的分析并使用原始数据和合成数据比较结果来评估数据效用,并讨论保留推断结论中的实际问题。我们发现,合成数据插补模型中必须包含重要的亚组关系,以保留给定分析程序中观察数据的数据效用。我们得出结论,合成的CanCORS数据最适合用于初步数据分析目的。这些方法满足了在不损害机密性的情况下共享临床研究数据的要求。