Poulis Giorgos, Loukides Grigorios, Skiadopoulos Spiros, Gkoulalas-Divanis Aris
Department of Informatics and Telecommunications, University of the Peloponnese, Greece.
Department of Informatics, King's College London, UK.
J Biomed Inform. 2017 Jan;65:76-96. doi: 10.1016/j.jbi.2016.11.001. Epub 2016 Nov 8.
Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k,k)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosis codes. To realize our approach, we develop an algorithm that enforces (k,k)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200,000 electronic health records show the effectiveness and efficiency of our algorithm.
发布包含人口统计学信息和诊断代码的患者数据对于开展大规模、低成本的医学研究至关重要。然而,保护此类数据的隐私性和实用性具有挑战性,因为这需要:(i)防范基于人口统计学信息和诊断代码的身份泄露(重新识别)攻击;(ii)确保匿名化后的数据在预期分析任务中仍然有用;(iii)将匿名化导致的信息损失降至最低,以保留数据发布前难以确定的一般分析任务的实用性。现有的匿名化方法不适用于这种情况,因为它们无法满足所有这三个要求。因此,在这项工作中,我们提出了一种新方法来处理这个问题。我们通过应用(k,k)-匿名性来满足要求(i),(k,k)-匿名性是一种隐私原则,可防止已知患者人口统计学信息及其多达m个诊断代码的攻击者进行重新识别,其中k和m是可调参数。为了满足要求(ii),我们针对人口统计学信息和诊断代码提出了效用约束的概念。效用约束限制了泛化程度,由数据所有者(例如进行匿名化的医疗机构)指定。我们还通过对人口统计学信息和诊断代码采用成熟的信息损失度量来满足要求(iii)。为了实现我们的方法,我们开发了一种算法,该算法对包含人口统计学信息和诊断代码的数据集实施(k,k)-匿名性,以满足指定的效用约束,并根据这些度量将信息损失降至最低。我们对包含超过20万份电子健康记录的大型数据集进行的实验表明了我们算法的有效性和效率。