通过模型定向掩码和多重插补平衡推理完整性和披露风险。

Balancing Inferential Integrity and Disclosure Risk via Model Targeted Masking and Multiple Imputation.

作者信息

Jiang Bei, Raftery Adrian E, Steele Russell J, Wang Naisyin

机构信息

Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada.

Department of Statistics, University of Washington, Seattle, WA 98195, USA.

出版信息

J Am Stat Assoc. 2022;117(537):52-66. doi: 10.1080/01621459.2021.1909597. Epub 2021 May 4.

DOI:10.1080/01621459.2021.1909597

PMID:39391212

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11466287/

Abstract

There is a growing expectation that data collected by government-funded studies should be openly available to ensure research reproducibility, and so is the concern on data-privacy. A strategy to protect individuals' identity is to release multiply imputed (MI) synthetic datasets with masked sensitivity values (Rubin, 1993). However, information loss or incorrectly specified imputation models can weaken or invalidate the inferences obtained from the MI-datasets. Studying a restricted-use Canadian Scleroderma Research Group (CSRG) dataset, the authors investigate the use of a new masking framework with a data-augmentation (DA) component and a tuning mechanism that balances between protecting identity-disclosure and preserving data-utility. They found, respectively in a work-disability and an interstitial lung disease study, using this DA-MI strategy reached 0% identity disclosure-risk, preserved all inferential conclusions, and on average produced 98.5% and 95.5% confidence intervals (CI) overlaps when compared to the 95% CIs constructed using the generic CSGR dataset; the lowest CI-overlap value is 91%. In contrast, the same is not true for the currently used methods; with the CI-overlap values ranging from 73.9% to 91.8% and the lowest value being 28.1%. These findings indicate that the DA-MI masking framework facilitates sharing of useful research data while protecting participants' identities.

摘要

人们越来越期望政府资助研究收集的数据应公开可用以确保研究的可重复性，同时对数据隐私的担忧也与日俱增。保护个人身份的一种策略是发布具有掩码敏感值的多重插补（MI）合成数据集（鲁宾，1993年）。然而，信息丢失或插补模型指定错误会削弱或使从MI数据集获得的推断无效。通过研究一个受限使用的加拿大硬皮病研究小组（CSRG）数据集，作者们调查了一种新的掩码框架的使用情况，该框架具有数据增强（DA）组件和一种在保护身份泄露与保留数据效用之间取得平衡的调整机制。他们分别在一项工作残疾研究和一项间质性肺病研究中发现，使用这种DA - MI策略实现了0%的身份泄露风险，保留了所有推断结论，并且与使用通用CSGR数据集构建的95%置信区间（CI）相比，平均产生了98.5%和95.5%的置信区间重叠；最低的CI重叠值为91%。相比之下，当前使用的方法并非如此；CI重叠值在73.9%至91.8%之间，最低值为28.1%。这些发现表明，DA - MI掩码框架在保护参与者身份的同时促进了有用研究数据的共享。