Wang Meng, Ji Zhanglong, Kim Hyeon-Eui, Wang Shuang, Xiong Li, Jiang Xiaoqian
Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S., and now is with the Department of Genetics, Stanford University, CA, 94305, U.S.
Department of Biomedical Informatics, University of California at San Diego, CA, 92093 U.S.
IEEE Trans Knowl Data Eng. 2018 Mar 1;30(3):573-584. doi: 10.1109/TKDE.2017.2773545. Epub 2017 Nov 14.
Privacy concern in data sharing especially for health data gains particularly increasing attention nowadays. Now some patients agree to open their information for research use, which gives rise to a new question of how to effectively use the public information to better understand the private dataset without breaching privacy. In this paper, we specialize this question as selecting an optimal subset of the public dataset for M-estimators in the framework of differential privacy (DP) in [1]. From a perspective of non-interactive learning, we first construct the weighted private density estimation from the hybrid datasets under DP. Along the same line as [2], we analyze the accuracy of the DP M-estimators based on the hybrid datasets. Our main contributions are (i) we find that the bias-variance tradeoff in the performance of our M-estimators can be characterized in the sample size of the released dataset; (2) based on this finding, we develop an algorithm to select the optimal subset of the public dataset to release under DP. Our simulation studies and application to the real datasets confirm our findings and set a guideline in the real application.
如今,数据共享中的隐私问题,尤其是健康数据方面,受到了越来越多的关注。现在一些患者同意公开其信息以供研究使用,这引发了一个新问题:如何在不侵犯隐私的情况下,有效利用公开信息更好地理解私有数据集。在本文中,我们将这个问题具体化为在[1]中差分隐私(DP)框架下为M估计器选择公共数据集的最优子集。从非交互式学习的角度出发,我们首先在DP下从混合数据集中构建加权私有密度估计。与[2]思路一致,我们分析了基于混合数据集的DP M估计器的准确性。我们的主要贡献在于:(i)我们发现M估计器性能中的偏差 - 方差权衡可以通过发布数据集的样本大小来表征;(2)基于这一发现,我们开发了一种算法,用于在DP下选择要发布的公共数据集的最优子集。我们的模拟研究以及对真实数据集的应用证实了我们的发现,并为实际应用设定了指导方针。