Li Xiao-Bai, Sarkar Sumit
Department of Operations and Information Systems, Manning School of Business, University of Massachusetts Lowell, Lowell, MA 01854 U.S.A. {
Naveen Jindal School of Management, University of Texas at Dallas, Richardson, TX 75080 U.S.A. {
MIS Q. 2014 Sep;38(3):679-698. doi: 10.25300/misq/2014/38.3.03.
Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-analysis and data-mining technique, can be used to effectively reveal individuals' sensitive data. This problem, which we call a "regression attack," has not been addressed in the data privacy literature, and existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach introduces a novel measure, called , which assesses the sensitive value disclosure risk in the process of building a regression tree model. Specifically, we develop an algorithm that uses the measure for pruning the tree to limit disclosure of sensitive data. We also propose a dynamic value-concatenation method for anonymizing data, which better preserves data utility than a user-defined generalization scheme commonly used in existing approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted using real-world financial, economic and healthcare data. The results of the experiments demonstrate that the proposed approach is very effective in protecting data privacy while preserving data quality for research and analysis.
回归技术不仅可用于合理的数据分析,还可用于推断有关个人的隐私信息。在本文中,我们证明了回归树(一种流行的数据分析和数据挖掘技术)可用于有效揭示个人的敏感数据。我们将这个问题称为“回归攻击”,数据隐私文献中尚未解决此问题,并且现有的隐私保护技术不适用于应对此问题。我们提出了一种应对回归攻击的新方法。为防止隐私泄露,我们的方法引入了一种名为 的新度量,该度量在构建回归树模型的过程中评估敏感值泄露风险。具体而言,我们开发了一种算法,该算法使用该度量来修剪树以限制敏感数据的泄露。我们还提出了一种用于数据匿名化的动态值串联方法,与现有方法中常用的用户定义泛化方案相比,该方法能更好地保留数据效用。我们的方法可用于对数值型和类别型数据进行匿名化处理。使用真实世界的金融、经济和医疗数据进行了一项实验研究。实验结果表明,所提出的方法在保护数据隐私的同时,能有效地为研究和分析保留数据质量。