Yigzaw Kassaye Yitbarek, Michalas Antonis, Bellika Johan Gustav
Department of Computer Science, UiT The Arctic University of Norway, 9037, Tromsø, Norway.
Norwegian Centre for E-health Research, University Hospital of North Norway, 9019, Tromsø, Norway.
BMC Med Inform Decis Mak. 2017 Jan 3;17(1):1. doi: 10.1186/s12911-016-0389-x.
Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step.
We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network.
The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem.
The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.
已开发出一些技术,用于在不泄露除统计结果之外的私人信息的情况下,对分布式数据集进行统计计算。然而,分布式数据集中的重复记录可能会导致统计结果不正确。因此,为提高分布式数据集统计分析的准确性,安全去重是一个重要的预处理步骤。
我们使用确定性记录链接算法,为水平分区数据集的去重设计了一种安全协议。我们在存在半诚实对手的情况下,对该协议进行了形式化安全分析。该协议在位于挪威的三个微生物实验室中实现并部署,我们在每个实验室记录数量不同的数据集上进行了实验。还对通过局域网连接的模拟微生物数据集和数据保管人进行了实验。
安全分析表明,该协议在半诚实对抗模型下保护了个人和数据保管人的隐私。更确切地说,该协议在多达N - 2个腐败数据保管人勾结的情况下仍保持安全。该协议的总运行时间随着数据保管人和记录的增加呈线性扩展。分布在20个数据保管人之间的100万条模拟记录在45秒内完成了去重。实验结果表明,对于同一问题,该协议比以前的协议更高效、更具可扩展性。
所提出的去重协议在保护患者和数据保管人隐私的同时,对于实际应用而言高效且具有可扩展性。