Durham Elizabeth Ashley, Kantarcioglu Murat, Xue Yuan, Toth Csaba, Kuzu Mehmet, Malin Bradley
Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232.
Department of Computer Science, University of Texas at Dallas, Richardson, TX, 75083.
IEEE Trans Knowl Data Eng. 2014 Dec;26(12):2956-2968. doi: 10.1109/TKDE.2013.91.
The process of record linkage seeks to integrate instances that correspond to the same entity. Record linkage has traditionally been performed through the comparison of identifying field values (), however, when databases are maintained by disparate organizations, the disclosure of such information can breach the privacy of the corresponding individuals. Various private record linkage (PRL) methods have been developed to obscure such identifiers, but they vary widely in their ability to balance competing goals of accuracy, efficiency and security. The tokenization and hashing of field values into Bloom filters (BF) enables greater linkage accuracy and efficiency than other PRL methods, but the encodings may be compromised through frequency-based cryptanalysis. Our objective is to adapt a BF encoding technique to mitigate such attacks with minimal sacrifices in accuracy and efficiency. To accomplish these goals, we introduce a statistically-informed method to generate BF encodings that integrate bits from multiple fields, the frequencies of which are provably associated with a minimum number of fields. Our method enables a user-specified tradeoff between security and accuracy. We compare our encoding method with other techniques using a public dataset of voter registration records and demonstrate that the increases in security come with only minor losses to accuracy.
记录链接过程旨在整合对应于同一实体的实例。传统上,记录链接是通过比较标识字段值来执行的,然而,当数据库由不同组织维护时,此类信息的披露可能会侵犯相应个人的隐私。已经开发了各种私有记录链接(PRL)方法来模糊此类标识符,但它们在平衡准确性、效率和安全性等相互竞争的目标的能力方面差异很大。将字段值进行令牌化和散列到布隆过滤器(BF)中,与其他PRL方法相比,能够实现更高的链接准确性和效率,但编码可能会通过基于频率的密码分析而受到破坏。我们的目标是采用一种BF编码技术,以在准确性和效率方面做出最小牺牲的情况下减轻此类攻击。为了实现这些目标,我们引入一种基于统计的方法来生成BF编码,该方法整合来自多个字段的位,这些字段的频率可证明与最少数量的字段相关联。我们的方法允许在安全性和准确性之间进行用户指定的权衡。我们使用选民登记记录的公共数据集将我们的编码方法与其他技术进行比较,并证明安全性的提高仅伴随着准确性的轻微损失。