Malin Bradley
Department of Biomedical Informatics, Eskind Biomedical Library, Fourth Floor, 2209 Garland Avenue, Vanderbilt University, Nashville, TN 37232-8340, USA.
Artif Intell Med. 2007 Jul;40(3):223-39. doi: 10.1016/j.artmed.2007.04.002. Epub 2007 Jun 1.
Health care organizations must preserve a patient's anonymity when disclosing personal data. Traditionally, patient identity has been protected by stripping identifiers from sensitive data such as DNA. However, simple automated methods can re-identify patient data using public information. In this paper, we present a solution to prevent a threat to patient anonymity that arises when multiple health care organizations disclose data. In this setting, a patient's location visit pattern, or "trail", can re-identify seemingly anonymous DNA to patient identity. This threat exists because health care organizations (1) cannot prevent the disclosure of certain types of patient information and (2) do not know how to systematically avoid trail re-identification. In this paper, we develop and evaluate computational methods that health care organizations can apply to disclose patient-specific DNA records that are impregnable to trail re-identification.
To prevent trail re-identification, we introduce a formal model called k-unlinkability, which enables health care administrators to specify different degrees of patient anonymity. Specifically, k-unlinkability is satisfied when the trail of each DNA record is linkable to no less than k identified records. We present several algorithms that enable health care organizations to coordinate their data disclosure, so that they can determine which DNA records can be shared without violating k-unlinkability. We evaluate the algorithms with the trails of patient populations derived from publicly available hospital discharge databases. Algorithm efficacy is evaluated using metrics based on real world applications, including the number of suppressed records and the number of organizations that disclose records.
Our experiments indicate that it is unnecessary to suppress all patient records that initially violate k-unlinkability. Rather, only portions of the trails need to be suppressed. For example, if each hospital discloses 100% of its data on patients diagnosed with cystic fibrosis, then 48% of the DNA records are 5-unlinkable. A naïve solution would suppress the 52% of the DNA records that violate 5-unlinkability. However, by applying our protection algorithms, the hospitals can disclose 95% of the DNA records, all of which are 5-unlinkable. Similar findings hold for all populations studied.
This research demonstrates that patient anonymity can be formally protected in shared databases. Our findings illustrate that significant quantities of patient-specific data can be disclosed with provable protection from trail re-identification. The configurability of our methods allows health care administrators to quantify the effects of different levels of privacy protection and formulate policy accordingly.
医疗保健机构在披露个人数据时必须保护患者的匿名性。传统上,通过从DNA等敏感数据中去除标识符来保护患者身份。然而,简单的自动化方法可以利用公开信息重新识别患者数据。在本文中,我们提出了一种解决方案,以防止多个医疗保健机构披露数据时对患者匿名性产生的威胁。在这种情况下,患者的位置访问模式或“踪迹”可以将看似匿名的DNA重新识别为患者身份。这种威胁之所以存在,是因为医疗保健机构(1)无法阻止某些类型的患者信息的披露,并且(2)不知道如何系统地避免踪迹重新识别。在本文中,我们开发并评估了计算方法,医疗保健机构可以应用这些方法来披露无法通过踪迹重新识别的患者特定DNA记录。
为了防止踪迹重新识别,我们引入了一个名为k-不可链接性的形式模型,该模型使医疗保健管理人员能够指定不同程度的患者匿名性。具体而言,当每个DNA记录的踪迹可与不少于k个已识别记录链接时,k-不可链接性得到满足。我们提出了几种算法,使医疗保健机构能够协调其数据披露,以便他们可以确定哪些DNA记录可以在不违反k-不可链接性的情况下共享。我们使用从公开可用的医院出院数据库中得出的患者群体的踪迹来评估这些算法。使用基于实际应用的指标来评估算法的有效性,包括被抑制记录的数量和披露记录的机构数量。
我们的实验表明,没有必要抑制所有最初违反k-不可链接性的患者记录。相反,只需要抑制部分踪迹。例如,如果每家医院披露其关于诊断为囊性纤维化患者的100%数据,那么48%的DNA记录是5-不可链接的。一个简单的解决方案会抑制违反5-不可链接性的52%的DNA记录。然而,通过应用我们的保护算法,医院可以披露95%的DNA记录,所有这些记录都是5-不可链接的。所有研究的人群都有类似的发现。
这项研究表明,在共享数据库中可以正式保护患者的匿名性。我们的研究结果表明,可以在可证明防止踪迹重新识别的情况下披露大量患者特定数据。我们方法的可配置性使医疗保健管理人员能够量化不同级别的隐私保护的效果,并据此制定政策。