Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN 37232-8340, USA.
J Am Med Inform Assoc. 2013 Jan 1;20(1):95-101. doi: 10.1136/amiajnl-2012-001026. Epub 2012 Jul 21.
To try to lower patient re-identification risks for biomedical research databases containing laboratory test results while also minimizing changes in clinical data interpretation.
In our threat model, an attacker obtains 5-7 laboratory results from one patient and uses them as a search key to discover the corresponding record in a de-identified biomedical research database. To test our models, the existing Vanderbilt TIME database of 8.5 million Safe Harbor de-identified laboratory results from 61 280 patients was used. The uniqueness of unaltered laboratory results in the dataset was examined, and then two data perturbation models were applied-simple random offsets and an expert-derived clinical meaning-preserving model. A rank-based re-identification algorithm to mimic an attack was used. The re-identification risk and the retention of clinical meaning for each model's perturbed laboratory results were assessed.
Differences in re-identification rates between the algorithms were small despite substantial divergence in altered clinical meaning. The expert algorithm maintained the clinical meaning of laboratory results better (affecting up to 4% of test results) than simple perturbation (affecting up to 26%).
With growing impetus for sharing clinical data for research, and in view of healthcare-related federal privacy regulation, methods to mitigate risks of re-identification are important. A practical, expert-derived perturbation algorithm that demonstrated potential utility was developed. Similar approaches might enable administrators to select data protection scheme parameters that meet their preferences in the trade-off between the protection of privacy and the retention of clinical meaning of shared data.
在最小化改变临床数据解读的同时,尝试降低包含实验室检测结果的生物医学研究数据库中患者再识别的风险。
在我们的威胁模型中,攻击者从一名患者中获取 5-7 项实验室结果,并将其用作搜索键,以在去标识化的生物医学研究数据库中发现相应的记录。为了测试我们的模型,使用了现有的范德比尔特 TIME 数据库,该数据库包含 61 280 名患者的 850 万份符合“安全港”原则的去标识化实验室结果。检查了数据集中原始实验室结果的独特性,然后应用了两种数据扰动模型——简单随机偏移和基于专家的临床意义保留模型。使用基于排名的再识别算法来模拟攻击。评估了每个模型的扰动实验室结果的再识别风险和临床意义保留情况。
尽管改变的临床意义存在很大差异,但算法之间的再识别率差异很小。专家算法比简单扰动更好地保留了实验室结果的临床意义(影响多达 4%的测试结果),而简单扰动则影响多达 26%。
随着为研究目的共享临床数据的动力不断增强,并且鉴于与医疗保健相关的联邦隐私法规,减轻再识别风险的方法非常重要。开发了一种实用的、基于专家的扰动算法,该算法具有潜在的应用价值。类似的方法可能使管理员能够选择数据保护方案参数,以在隐私保护和共享数据的临床意义保留之间的权衡中满足其偏好。