Suppr超能文献

医疗领域中用于保护敏感信息的局部差分隐私:算法开发与实际验证

Local Differential Privacy in the Medical Domain to Protect Sensitive Information: Algorithm Development and Real-World Validation.

作者信息

Sung MinDong, Cha Dongchul, Park Yu Rang

机构信息

Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea.

Department of Otorhinolaryngology, Yonsei University College of Medicine, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2021 Nov 8;9(11):e26914. doi: 10.2196/26914.

Abstract

BACKGROUND

Privacy is of increasing interest in the present big data era, particularly the privacy of medical data. Specifically, differential privacy has emerged as the standard method for preservation of privacy during data analysis and publishing.

OBJECTIVE

Using machine learning techniques, we applied differential privacy to medical data with diverse parameters and checked the feasibility of our algorithms with synthetic data as well as the balance between data privacy and utility.

METHODS

All data were normalized to a range between -1 and 1, and the bounded Laplacian method was applied to prevent the generation of out-of-bound values after applying the differential privacy algorithm. To preserve the cardinality of the categorical variables, we performed postprocessing via discretization. The algorithm was evaluated using both synthetic and real-world data (from the eICU Collaborative Research Database). We evaluated the difference between the original data and the perturbated data using misclassification rates and the mean squared error for categorical data and continuous data, respectively. Further, we compared the performance of classification models that predict in-hospital mortality using real-world data.

RESULTS

The misclassification rate of categorical variables ranged between 0.49 and 0.85 when the value of ε was 0.1, and it converged to 0 as ε increased. When ε was between 10 and 10, the misclassification rate rapidly dropped to 0. Similarly, the mean squared error of the continuous variables decreased as ε increased. The performance of the model developed from perturbed data converged to that of the model developed from original data as ε increased. In particular, the accuracy of a random forest model developed from the original data was 0.801, and this value ranged from 0.757 to 0.81 when ε was 10 and 10, respectively.

CONCLUSIONS

We applied local differential privacy to medical domain data, which are diverse and high dimensional. Higher noise may offer enhanced privacy, but it simultaneously hinders utility. We should choose an appropriate degree of noise for data perturbation to balance privacy and utility depending on specific situations.

摘要

背景

在当前的大数据时代,隐私问题越来越受到关注,尤其是医疗数据的隐私。具体而言,差分隐私已成为数据分析和发布过程中保护隐私的标准方法。

目的

使用机器学习技术,我们将差分隐私应用于具有不同参数的医疗数据,并使用合成数据检查了我们算法的可行性以及数据隐私与效用之间的平衡。

方法

所有数据均归一化到 -1 到 1 的范围内,并应用有界拉普拉斯方法以防止在应用差分隐私算法后产生超出范围的值。为了保留分类变量的基数,我们通过离散化进行后处理。该算法使用合成数据和真实世界数据(来自 eICU 协作研究数据库)进行评估。我们分别使用分类数据和连续数据的误分类率和均方误差来评估原始数据和扰动数据之间的差异。此外,我们比较了使用真实世界数据预测住院死亡率的分类模型的性能。

结果

当 ε 值为 0.1 时,分类变量的误分类率在 0.49 到 0.85 之间,并且随着 ε 的增加收敛到 0。当 ε 在 10 到 10 之间时,误分类率迅速降至 0。同样,连续变量的均方误差随着 ε 的增加而减小。随着 ε 的增加,由扰动数据开发的模型的性能收敛到由原始数据开发的模型的性能。特别是,由原始数据开发的随机森林模型的准确率为 0.801,当 ε 分别为 10 和 10 时,该值在 0.757 到 0.81 之间。

结论

我们将局部差分隐私应用于多样且高维的医疗领域数据。更高的噪声可能提供增强的隐私,但同时也会阻碍效用。我们应该根据具体情况选择合适的噪声程度进行数据扰动,以平衡隐私和效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49bc/8663640/e70424a9e4f9/medinform_v9i11e26914_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验