一种可扩展且实用的高质量健康数据安全共享方法。

A Scalable and Pragmatic Method for the Safe Sharing of High-Quality Health Data.

出版信息

IEEE J Biomed Health Inform. 2018 Mar;22(2):611-622. doi: 10.1109/JBHI.2017.2676880. Epub 2017 Mar 23.

DOI:10.1109/JBHI.2017.2676880

Abstract

The sharing of sensitive personal health data is an important aspect of biomedical research. Methods of data de-identification are often used in this process to trade the granularity of data off against privacy risks. However, traditional approaches, such as HIPAA safe harbor or -anonymization, often fail to provide data with sufficient quality. Alternatively, data can be de-identified only to a degree which still allows us to use it as required, e.g., to carry out specific analyses. Controlled environments, which restrict the ways recipients can interact with the data, can then be used to cope with residual risks. The contributions of this article are twofold. First, we present a method for implementing controlled data sharing environments and analyze its privacy properties. Second, we present a de-identification method which is specifically suited for sanitizing health data which is to be shared in such environments. Traditional de-identification methods control the uniqueness of records in a dataset. The basic idea of our approach is to reduce the probability that a record in a dataset has characteristics which are unique within the underlying population. As the characteristics of the population are typically not known, we have implemented a pragmatic solution in which properties of the population are modeled with statistical methods. We have further developed an accompanying process for evaluating and validating the degree of protection provided. The results of an extensive experimental evaluation show that our approach enables the safe sharing of high-quality data and that it is highly scalable.

摘要

敏感个人健康数据的共享是生物医学研究的一个重要方面。在这个过程中，通常使用数据去识别方法来权衡数据的粒度和隐私风险。然而，传统的方法，如 HIPAA 安全港或匿名化，往往不能提供足够质量的数据。或者，数据只能被去识别到一定程度，仍然允许我们按照要求使用它，例如，进行特定的分析。然后，可以使用受控制的环境来限制收件人与数据交互的方式，以应对剩余的风险。本文的贡献有两个方面。首先，我们提出了一种实现受控数据共享环境的方法，并分析了其隐私属性。其次，我们提出了一种去识别方法，专门用于清理要在这种环境中共享的健康数据。传统的去识别方法控制数据集中记录的唯一性。我们方法的基本思想是降低数据集内记录具有在基础人群中唯一特征的概率。由于人群的特征通常是未知的，我们已经实现了一个实用的解决方案，其中使用统计方法对人群的特征进行建模。我们进一步开发了一个伴随的过程来评估和验证所提供的保护程度。广泛的实验评估结果表明，我们的方法能够安全地共享高质量的数据，并且具有高度的可扩展性。