Wang Hao, Reiter Jerome P
Department of Statistics, University of South Carolina, Columbia, South Carolina 29208, USA.
Ann Appl Stat. 2012 Mar 1;6(1):229-252. doi: 10.1214/11-AOAS506.
When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects' identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.
在向公众发布数据时,数据管理员在道德上且通常在法律上有义务保护数据主体身份和敏感属性的机密性。他们还努力发布对广泛的二次分析有参考价值的数据。当数据管理员试图发布高分辨率地理信息时,要实现这两个目标尤其具有挑战性。我们提出了一种基于多重插补来保护带有地理标识符的数据机密性的方法。基本思想是将地理信息转换为经纬度,基于属性估计二元响应模型,并从这些模型中模拟新的经纬度值。我们使用描述北卡罗来纳州达勒姆市死因的数据来说明所提出的方法。在该应用背景下,我们展示了一个基于回归树生成模拟地理信息和属性的简单工具,并介绍了用此类模拟数据评估披露风险的方法。