Cybernetica AS, Ülikooli 2, Tartu, 51003, Estonia.
STACC, Ülikooli 2, Tartu, 51003, Estonia.
BMC Med Genomics. 2018 Oct 11;11(Suppl 4):84. doi: 10.1186/s12920-018-0400-8.
Practical applications for data analysis may require combining multiple databases belonging to different owners, such as health centers. The analysis should be performed without violating privacy of neither the centers themselves, nor the patients whose records these centers store. To avoid biased analysis results, it may be important to remove duplicate records among the centers, so that each patient's data would be taken into account only once. This task is very closely related to privacy-preserving record linkage.
This paper presents a solution to privacy-preserving deduplication among records of several databases using secure multiparty computation. It is build upon one of the fastest practical secure multiparty computation platforms, called Sharemind.
The tests on ca 10 million records of simulated databases with 1000 health centers of 10000 records each show that the computation is feasible in practice. The expected running time of the experiment is ca. 30 min for computing servers connected over 100 Mbit/s WAN, the expected error of the results is 2, and no errors have been detected for the particular test set that we used for our benchmarks.
The solution is ready for practical use. It has well-defined security properties, implied by the properties of Sharemind platform. The solution assumes that exact matching of records is required, and a possible future research would be extending it to approximate matching.
数据分析的实际应用可能需要结合属于不同所有者的多个数据库,例如健康中心。分析不应侵犯中心本身或存储这些中心记录的患者的隐私。为避免分析结果出现偏差,可能重要的是要删除中心之间的重复记录,以便仅考虑每个患者的数据一次。此任务与保护隐私的记录链接非常密切。
本文提出了一种使用安全多方计算在多个数据库的记录之间进行隐私保护去重的解决方案。它建立在最快的实用安全多方计算平台之一Sharemind 之上。
对具有 1000 个记录的 1000 个健康中心的模拟数据库中的约 1000 万条记录进行的测试表明,该计算在实践中是可行的。对于通过 100 Mbit/s WAN 连接的计算服务器,预计的实验运行时间约为 30 分钟,结果的预期误差为 2,并且对于我们用于基准测试的特定测试集未检测到任何错误。
该解决方案已准备好实际使用。它具有由 Sharemind 平台的属性隐含的明确定义的安全属性。该解决方案假设需要精确匹配记录,并且未来的一项研究可能是将其扩展到近似匹配。