Suppr超能文献

一种用于基于布隆过滤器的隐私保护记录链接的改进型中文字符串比较器。

An Improved Chinese String Comparator for Bloom Filter Based Privacy-Preserving Record Linkage.

作者信息

Sun Siqi, Qian Yining, Zhang Ruoshi, Wang Yanqi, Li Xinran

机构信息

Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan 430070, China.

出版信息

Entropy (Basel). 2021 Aug 22;23(8):1091. doi: 10.3390/e23081091.

Abstract

With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, the techniques of PPRL have mostly been studied based on the alphabetic language, which is much different from the Chinese language environment. In this paper, Chinese characters (identification fields in record pairs) are encoded into strings composed of letters and numbers by using the SoundShape code according to their shapes and pronunciations. Then, the SoundShape codes are encrypted by Bloom filter, and the similarity of encrypted fields is calculated by Dice similarity. In this method, the false positive rate of Bloom filter and different proportions of sound code and shape code are considered. Finally, we performed the above methods on the synthetic datasets, and compared the precision, recall, F1-score and computational time with different values of false positive rate and proportion. The results showed that our method for PPRL in Chinese language environment improved the quality of the classification results and outperformed others with a relatively low additional cost of computation.

摘要

随着信息技术的发展,在不泄露隐私的情况下共享多源数据已成为一个热门话题。隐私保护记录链接(PPRL)可以将真正匹配的数据进行链接,同时不披露个人信息。在现有研究中,PPRL技术大多是基于字母语言进行研究的,这与中文语言环境有很大不同。在本文中,汉字(记录对中的标识字段)根据其形状和读音,使用声形码编码为由字母和数字组成的字符串。然后,通过布隆过滤器对声形码进行加密,并使用迪杰斯特拉相似度计算加密字段的相似度。该方法考虑了布隆过滤器的误报率以及声码和形码的不同比例。最后,我们在合成数据集上执行上述方法,并比较了不同误报率和比例值下的精度、召回率、F1分数和计算时间。结果表明,我们在中文语言环境下的PPRL方法提高了分类结果的质量,并且在计算成本增加相对较低的情况下优于其他方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7238/8394278/bb8bba920244/entropy-23-01091-g005.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验