Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, Tennessee 37203, USA.
J Am Med Inform Assoc. 2010 May-Jun;17(3):322-7. doi: 10.1136/jamia.2009.002725.
De-identified clinical data in standardized form (eg, diagnosis codes), derived from electronic medical records, are increasingly combined with research data (eg, DNA sequences) and disseminated to enable scientific investigations. This study examines whether released data can be linked with identified clinical records that are accessible via various resources to jeopardize patients' anonymity, and the ability of popular privacy protection methodologies to prevent such an attack.
The study experimentally evaluates the re-identification risk of a de-identified sample of Vanderbilt's patient records involved in a genome-wide association study. It also measures the level of protection from re-identification, and data utility, provided by suppression and generalization.
Privacy protection is quantified using the probability of re-identifying a patient in a larger population through diagnosis codes. Data utility is measured at a dataset level, using the percentage of retained information, as well as its description, and at a patient level, using two metrics based on the difference between the distribution of Internal Classification of Disease (ICD) version 9 codes before and after applying privacy protection.
More than 96% of 2800 patients' records are shown to be uniquely identified by their diagnosis codes with respect to a population of 1.2 million patients. Generalization is shown to reduce further the percentage of de-identified records by less than 2%, and over 99% of the three-digit ICD-9 codes need to be suppressed to prevent re-identification.
Popular privacy protection methods are inadequate to deliver a sufficiently protected and useful result when sharing data derived from complex clinical systems. The development of alternative privacy protection models is thus required.
从电子病历中提取的以标准化形式呈现的去标识化临床数据(例如诊断代码)越来越多地与研究数据(例如 DNA 序列)相结合,并进行传播,以支持科学研究。本研究考察了发布的数据是否可以与通过各种资源可访问的标识化临床记录相关联,从而危及患者的匿名性,以及流行的隐私保护方法是否能够防止此类攻击。
本研究通过实验评估了参与全基因组关联研究的范德比尔特患者记录的去标识化样本的重新识别风险。它还衡量了抑制和泛化提供的重新识别保护和数据实用性的程度。
隐私保护通过使用通过诊断代码在更大的人群中重新识别患者的概率进行量化。数据实用性在数据集级别上进行衡量,使用保留信息的百分比以及其描述进行衡量,在患者级别上,使用基于内部疾病分类(ICD)版本 9 代码在应用隐私保护前后分布之间差异的两个指标进行衡量。
2800 名患者的记录中有超过 96%可以通过其诊断代码在 120 万患者的人群中唯一标识。泛化被证明可以进一步将去标识化记录的百分比降低不到 2%,并且需要抑制超过 99%的三位 ICD-9 代码才能防止重新识别。
当共享来自复杂临床系统的数据时,流行的隐私保护方法不足以提供足够的保护和有用的结果。因此,需要开发替代的隐私保护模型。