Li Xiao-Bai, Sarkar Sumit
Department of Operations and Information Systems, University of Massachusetts Lowell, Lowell, Massachusetts 01854.
Manage Sci. 2013 Apr 1;59(4). doi: 10.1287/mnsc.1120.1584.
The extensive use of information technologies by organizations to collect and share personal data has raised strong privacy concerns. To respond to the public's demand for data privacy, a class of clustering-based data masking techniques is increasingly being used for privacy-preserving data sharing and analytics. Traditional clustering-based approaches for masking numeric attributes, while addressing re-identification risks, typically do not consider the disclosure risk of categorical confidential attributes. We propose a new approach to deal with this problem. The proposed method clusters data such that the data points within a group are similar in the non-confidential attribute values whereas the confidential attribute values within a group are . To accomplish this, the clustering method, which is based on a minimum spanning tree (MST) technique, uses two risk-utility tradeoff measures in the growing and pruning stages of the MST technique respectively. As part of our approach we also propose a novel cluster-level micro-perturbation method for masking data that overcomes a common problem of traditional clustering-based methods for data masking, which is their inability to preserve important statistical properties such as the variance of attributes and the covariance across attributes. We show that the mean vector and the covariance matrix of the masked data generated using the micro-perturbation method are unbiased estimates of the original mean vector and covariance matrix. An experimental study on several real-world datasets demonstrates the effectiveness of the proposed approach.
组织广泛使用信息技术来收集和共享个人数据,这引发了强烈的隐私担忧。为了回应公众对数据隐私的需求,一类基于聚类的数据掩码技术越来越多地用于隐私保护数据共享和分析。传统的基于聚类的数字属性掩码方法在解决重新识别风险的同时,通常不考虑分类机密属性的披露风险。我们提出了一种新方法来处理这个问题。所提出的方法对数据进行聚类,使得组内的数据点在非机密属性值上相似,而组内的机密属性值则是……为了实现这一点,基于最小生成树(MST)技术的聚类方法在MST技术的生长和修剪阶段分别使用两种风险-效用权衡措施。作为我们方法的一部分,我们还提出了一种新颖的用于掩码数据的聚类级微扰动方法,该方法克服了传统基于聚类的数据掩码方法的一个常见问题,即它们无法保留重要的统计属性,如属性的方差和属性间的协方差。我们表明,使用微扰动方法生成的掩码数据的均值向量和协方差矩阵是原始均值向量和协方差矩阵的无偏估计。对几个真实世界数据集的实验研究证明了所提出方法的有效性。