Durham Elizabeth, Xue Yuan, Kantarcioglu Murat, Malin Bradley
Department of Biomedical Informatics, Vanderbilt University, 2525 West End Avenue, Nashville, TN 37203, USA.
Inf Fusion. 2012 Oct 1;13(4):245-259. doi: 10.1016/j.inffus.2011.04.004.
Record linkage is the task of identifying records from disparate data sources that refer to the same entity. It is an integral component of data processing in distributed settings, where the integration of information from multiple sources can prevent duplication and enrich overall data quality, thus enabling more detailed and correct analysis. Privacy-preserving record linkage (PPRL) is a variant of the task in which data owners wish to perform linkage without revealing identifiers associated with the records. This task is desirable in various domains, including healthcare, where it may not be possible to reveal patient identity due to confidentiality requirements, and in business, where it could be disadvantageous to divulge customers' identities. To perform PPRL, it is necessary to apply string comparators that function in the privacy-preserving space. A number of privacy-preserving string comparators (PPSCs) have been proposed, but little research has compared them in the context of a real record linkage application. This paper performs a principled and comprehensive evaluation of six PPSCs in terms of three key properties: 1) correctness of record linkage predictions, 2) computational complexity, and 3) security. We utilize a real publicly-available dataset, derived from the North Carolina voter registration database, to evaluate the tradeoffs between the aforementioned properties. Among our results, we find that PPSCs that partition, encode, and compare strings yield highly accurate record linkage results. However, as a tradeoff, we observe that such PPSCs are less secure than those that map and compare strings in a reduced dimensional space.
记录链接是指从不同数据源中识别出指向同一实体的记录的任务。它是分布式环境中数据处理的一个不可或缺的组成部分,在这种环境下,整合来自多个源的信息可以防止数据重复并提高整体数据质量,从而实现更详细、准确的分析。隐私保护记录链接(PPRL)是该任务的一种变体,其中数据所有者希望在不泄露与记录相关的标识符的情况下执行链接。在包括医疗保健在内的各个领域,由于保密要求可能无法透露患者身份,以及在商业领域,泄露客户身份可能不利,因此这项任务很有必要。为了执行PPRL,有必要应用在隐私保护空间中起作用的字符串比较器。已经提出了许多隐私保护字符串比较器(PPSC),但很少有研究在实际记录链接应用的背景下对它们进行比较。本文从三个关键属性方面对六个PPSC进行了有原则的全面评估:1)记录链接预测的正确性,2)计算复杂性,以及3)安全性。我们利用一个从北卡罗来纳州选民登记数据库导出的真实公开可用数据集,来评估上述属性之间的权衡。在我们的结果中,我们发现对字符串进行分区、编码和比较的PPSC产生了高度准确的记录链接结果。然而,作为一种权衡,我们观察到这类PPSC的安全性低于那些在降维空间中映射和比较字符串的PPSC。