Benitez Kathleen, Loukides Grigorios, Malin Bradley
IHI. 2010;2010:163-172. doi: 10.1145/1882992.1883017.
Regulations in various countries permit the reuse of health information without patient authorization provided the data is "de-identified". In the United States, for instance, the Privacy Rule of the Health Insurance Portability and Accountability Act defines two distinct approaches to achieve de-identification; the first is , which requires the removal of a list of identifiers and the second is , which requires that an expert certify the re-identification risk inherent in the data is sufficiently low. In reality, most healthcare organizations eschew the expert route because there are no standardized approaches and Safe Harbor is much simpler to interpret. This, however, precludes a wide range of worthwhile endeavors that are dependent on features suppressed by Safe Harbor, such as gerontological studies requiring detailed ages over 89. In response, we propose a novel approach to automatically discover alternative de-identification policies that contain no more re-identification risk than Safe Harbor. We model this task as a lattice-search problem, introduce a measure to capture the re-identification risk, and develop an algorithm that efficiently discovers polices by exploring the lattice. Using a cohort of approximately 3000 patient records from the Vanderbilt University Medical Center, as well as the Adult dataset from the UCI Machine Learning Repository, we also experimentally verify that a large number of alternative policies can be discovered in an efficient manner.
各国法规允许在未经患者授权的情况下重复使用健康信息,前提是数据已“去标识化”。例如,在美国,《健康保险流通与责任法案》的隐私规则定义了两种不同的去标识化方法;第一种是,要求删除一系列标识符,第二种是,要求专家证明数据中固有的重新识别风险足够低。实际上,大多数医疗保健组织都避开专家途径,因为没有标准化方法,而且“安全港”更容易解释。然而,这排除了一系列依赖于“安全港”所抑制特征的有价值的努力,比如需要详细的89岁以上年龄的老年学研究。作为回应,我们提出了一种新颖的方法来自动发现替代的去标识化策略,这些策略的重新识别风险不高于“安全港”。我们将此任务建模为格搜索问题,引入一种度量来捕捉重新识别风险,并开发一种通过探索格来有效发现策略的算法。使用范德堡大学医学中心的大约3000份患者记录以及加州大学欧文分校机器学习库的成人数据集,我们还通过实验验证了可以高效地发现大量替代策略。