Department of Computer Science and Software Engineering, Concordia University, Montreal, Quebec, Canada.
J Am Med Inform Assoc. 2013 May 1;20(3):462-9. doi: 10.1136/amiajnl-2012-001027. Epub 2012 Dec 13.
Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, ε-differential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε-differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data.
The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε-differential privacy.
We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis.
The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases.
Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis.
隐私保护数据发布旨在解决在挖掘有用信息时披露敏感数据的问题。在现有的隐私模型中,ε-差分隐私提供了最强的隐私保证之一,并且不假设对手的背景知识。所有现有的确保 ε-差分隐私的解决方案都分别以隐私保护的方式处理披露关系和集值数据的问题。在本文中,我们提出了一种算法,该算法在医疗保健数据的差分隐私披露中同时考虑了关系和集值数据。
所提出的方法在差分隐私算法设计中进行了一个简单而基本的转变:不是列出所有可能的记录(即,列联表)以添加噪声,而是在添加噪声之前对记录进行泛化。该算法首先以概率方式对原始数据进行泛化,然后添加噪声以保证 ε-差分隐私。
我们表明,所披露的数据可有效用于构建决策树归纳分类器。实验结果表明,该算法是可扩展的,并且在分类分析方面的性能优于现有解决方案。
当输出域的大小非常大时,产生的效用可能会降低,这使得为大型健康数据库生成合成数据可能不合适。
与现有技术不同,所提出的算法允许以差分隐私的方式披露包含关系和集值数据的健康数据,并且可以保留用于判别分析的基本信息。