Spengler Helmut, Prasser Fabian
Technical University of Munich, School of Medicine, Institute of Medical Informatics, Statistics and Epidemiology, Munich, Germany.
Stud Health Technol Inform. 2019 Sep 3;267:207-214. doi: 10.3233/SHTI190829.
Modern medical research requires access to patient-level data of significant detail and volume. In this context, privacy concerns and legal requirements demand careful consideration. Data anonymization, which means that data is transformed to reduce privacy risks, is an important building block of data protection concepts. However, common methods of data anonymization often fail to protect data against inference of sensitive attribute values (also called attribute disclosure). Measures against such attacks have been developed, but it has been argued that they are of little practical relevance, as they involve significant data transformations which reduce output data utility to an unacceptable degree. In this article, we present an experimental study of the degree of protection and impact on data utility provided by different approaches for protecting biomedical data from attribute disclosure. We quantified the utility and privacy risks of datasets that have been protected using different anonymization methods and parameterizations. We put the results into relation with trivial baseline approaches, visualized them in the form of risk-utility curves and analyzed basic statistical properties of the sensitive attributes (e.g. the skewness of their distribution). Our results confirm that it is difficult to protect data from attribute disclosure, but they also indicate that it can be possible to achieve reasonable degrees of protection when appropriate methods are chosen based on data characteristics. While it is hard to give general recommendations, the approach presented in this article and the tools that we have used can be helpful for deciding how a given dataset can best be protected in a specific usage scenario.
现代医学研究需要获取具有大量详细信息的患者层面数据。在这种情况下,隐私问题和法律要求需要仔细考量。数据匿名化,即对数据进行转换以降低隐私风险,是数据保护概念的一个重要组成部分。然而,常见的数据匿名化方法往往无法保护数据免受敏感属性值推断(也称为属性泄露)的影响。针对此类攻击的防范措施已经有所发展,但有人认为这些措施实际意义不大,因为它们涉及重大的数据转换,会将输出数据的效用降低到不可接受的程度。在本文中,我们针对不同方法保护生物医学数据免遭属性泄露所提供的保护程度及其对数据效用的影响进行了一项实验研究。我们对使用不同匿名化方法和参数设置进行保护的数据集的效用和隐私风险进行了量化。我们将结果与简单的基线方法进行比较,以风险 - 效用曲线的形式进行可视化展示,并分析了敏感属性的基本统计特性(例如其分布的偏度)。我们的结果证实,保护数据免遭属性泄露并非易事,但也表明,根据数据特征选择合适的方法时,可以实现合理程度的保护。虽然很难给出一般性建议,但本文所介绍的方法以及我们所使用的工具有助于确定在特定使用场景下如何最好地保护给定数据集。