去卷积核密度估计和回归用于局部差分隐私数据。

Deconvoluting kernel density estimation and regression for locally differentially private data.

机构信息

Department of Electrical and Electronic Engineering, The University of Melbourne, Parkville, VIC, 3010, Australia.

出版信息

Sci Rep. 2020 Dec 7;10(1):21361. doi: 10.1038/s41598-020-78323-0.

DOI:10.1038/s41598-020-78323-0

PMID:33288799

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7721740/

Abstract

Local differential privacy has become the gold-standard of privacy literature for gathering or releasing sensitive individual data points in a privacy-preserving manner. However, locally differential data can twist the probability density of the data because of the additive noise used to ensure privacy. In fact, the density of privacy-preserving data (no matter how many samples we gather) is always flatter in comparison with the density function of the original data points due to convolution with privacy-preserving noise density function. The effect is especially more pronounced when using slow-decaying privacy-preserving noises, such as the Laplace noise. This can result in under/over-estimation of the heavy-hitters. This is an important challenge facing social scientists due to the use of differential privacy in the 2020 Census in the United States. In this paper, we develop density estimation methods using smoothing kernels. We use the framework of deconvoluting kernel density estimators to remove the effect of privacy-preserving noise. This approach also allows us to adapt the results from non-parametric regression with errors-in-variables to develop regression models based on locally differentially private data. We demonstrate the performance of the developed methods on financial and demographic datasets.

摘要

局部差分隐私已成为隐私文献的黄金标准，用于以保护隐私的方式收集或发布敏感的个人数据点。然而，由于用于确保隐私的加性噪声，局部差分数据可能会扭曲数据的概率密度。实际上，由于与隐私保护噪声密度函数的卷积，隐私保护数据（无论我们收集多少个样本）的密度始终比原始数据点的密度函数更平坦。当使用缓慢衰减的隐私保护噪声（如拉普拉斯噪声）时，这种效果更加明显。这可能导致重尾的低估/高估。由于美国 2020 年人口普查中使用了差分隐私，因此这是社会科学家面临的一个重要挑战。在本文中，我们使用平滑核开发了密度估计方法。我们使用解卷积核密度估计器的框架来消除隐私保护噪声的影响。这种方法还允许我们根据带有误差的变量的非参数回归结果来开发基于局部差分隐私数据的回归模型。我们在金融和人口数据集上展示了所开发方法的性能。