背景的重要性：基于风险的生物医学数据去识别化

The Importance of Context: Risk-based De-identification of Biomedical Data.

作者信息

Prasser Fabian, Kohlmayer Florian, Kuhn Klaus A

机构信息

Dr. Fabian Prasser, Institute of Medical Statistics and Epidemiology, University Hospital rechts der Isar, Technical University of Munich, Grillparzerstr. 18, 81675 Munich, Germany, E-mail:

出版信息

Methods Inf Med. 2016 Aug 5;55(4):347-55. doi: 10.3414/ME16-01-0012. Epub 2016 Jun 20.

DOI:10.3414/ME16-01-0012

PMID:27322502

Abstract

BACKGROUND

Data sharing is a central aspect of modern biomedical research. It is accompanied by significant privacy concerns and often data needs to be protected from re-identification. With methods of de-identification datasets can be transformed in such a way that it becomes extremely difficult to link their records to identified individuals. The most important challenge in this process is to find an adequate balance between an increase in privacy and a decrease in data quality.

OBJECTIVES

Accurately measuring the risk of re-identification in a specific data sharing scenario is an important aspect of data de-identification. Overestimation of risks will significantly deteriorate data quality, while underestimation will leave data prone to attacks on privacy. Several models have been proposed for measuring risks, but there is a lack of generic methods for risk-based data de-identification. The aim of the work described in this article was to bridge this gap and to show how the quality of de-identified datasets can be improved by using risk models to tailor the process of de-identification to a concrete context.

METHODS

We implemented a generic de-identification process and several models for measuring re-identification risks into the ARX de-identification tool for biomedical data. By integrating the methods into an existing framework, we were able to automatically transform datasets in such a way that information loss is minimized while it is ensured that re-identification risks meet a user-defined threshold. We performed an extensive experimental evaluation to analyze the impact of using different risk models and assumptions about the goals and the background knowledge of an attacker on the quality of de-identified data.

RESULTS

The results of our experiments show that data quality can be improved significantly by using risk models for data de-identification. On a scale where 100 % represents the original input dataset and 0 % represents a dataset from which all information has been removed, the loss of information content could be reduced by up to 10 % when protecting datasets against strong adversaries and by up to 24 % when protecting datasets against weaker adversaries.

CONCLUSIONS

The methods studied in this article are well suited for protecting sensitive biomedical data and our implementation is available as open-source software. Our results can be used by data custodians to increase the information content of de-identified data by tailoring the process to a specific data sharing scenario. Improving data quality is important for fostering the adoption of de-identification methods in biomedical research.

摘要

背景

数据共享是现代生物医学研究的核心内容。它伴随着重大的隐私问题，并且数据通常需要受到保护以免被重新识别。通过去识别方法，数据集可以以一种使其记录极难与已识别个体相联系的方式进行转换。此过程中最重要的挑战是在隐私增强和数据质量下降之间找到适当的平衡。

目的

准确测量特定数据共享场景中的重新识别风险是数据去识别的一个重要方面。风险高估会显著降低数据质量，而低估则会使数据容易受到隐私攻击。已经提出了几种用于测量风险的模型，但缺乏基于风险的数据去识别通用方法。本文所述工作的目的是弥合这一差距，并展示如何通过使用风险模型将去识别过程定制到具体情境来提高去识别数据集的质量。

方法

我们在用于生物医学数据的ARX去识别工具中实现了一个通用的去识别过程和几个用于测量重新识别风险的模型。通过将这些方法集成到现有框架中，我们能够自动转换数据集，使得信息损失最小化，同时确保重新识别风险满足用户定义的阈值。我们进行了广泛的实验评估，以分析使用不同风险模型以及关于攻击者目标和背景知识的假设对去识别数据质量的影响。