Chaudhuri Kamalika, Hsu Daniel
University of California, San Diego, 9500 Gilman Drive #0404, La Jolla, CA 92093-0404.
Microsoft Research New England, One Memorial Drive, Cambridge, MA 02142.
JMLR Workshop Conf Proc. 2011;2011:155-186.
This work studies the problem of privacy-preserving classification - namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution chosen independently of the sensitive data. Given such a reference , we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy - namely, that the only labels (and not the unlabeled parts of the examples) be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.
这项工作研究了隐私保护分类问题——即在保护训练集中个体隐私的同时,从敏感数据中学习分类器。具体而言,该问题要求学习算法保证差分隐私,这是近年来备受关注的一种非常强的隐私概念。一个自然要问的问题是:保证一定隐私水平和准确性的学习算法的样本需求是什么?我们在数据从连续分布中抽取且假设类无穷的学习背景下解决这个问题。我们首先表明,即使对于非常简单的假设类,任何使用有限数量示例并保证差分隐私的算法,对于至少一些未标记数据分布,必定无法返回准确的分类器。这个结果与有限假设类或离散数据域的情况不同,在有限假设类或离散数据域中,如Kasiviswanathan等人(2008年)先前所示,无分布的隐私学习是可能的。然后,我们考虑两种规避此下限的差分隐私学习方法。第一种方法是使用关于未标记数据分布的先验知识,形式为独立于敏感数据选择的参考分布。给定这样一个参考,我们给出样本需求的上限,该上限(除其他因素外)取决于参考分布与未标记数据分布之间的接近程度度量。我们的上限适用于不可实现以及可实现的情况。第二种方法是通过仅要求标签隐私来放宽隐私要求——即,仅将标签(而不是示例的未标记部分)视为敏感信息。Chaudhuri等人(2006年)给出了具有标签隐私的学习的样本需求上限;在这项工作中,我们给出了下限。