Database Group, University of Leipzig, Leipzig, Germany.
Federated Information Systems, German Cancer Research Center, Heidelberg, Germany.
J Transl Med. 2021 Jan 15;19(1):33. doi: 10.1186/s12967-020-02678-1.
Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases.
We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage.
The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality.
We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.
生物医学研究的数据分析通常需要进行记录链接步骤,以识别来自多个数据源的记录,这些记录指的是同一个人。由于这些来源中缺乏唯一的个人标识符,记录链接依赖于个人数据(如名字或出生日期)的相似性。然而,与第三方交换此类识别数据(如记录链接的情况)通常受到严格的隐私要求的限制。隐私保护记录链接(PPRL)和假名化服务解决了这个问题。Mainzelliste 是一个开源的记录链接和假名化服务,用于在实际用例中执行 PPRL 过程。
我们使用具有不同属性的几个真实和近真实数据集来评估链接过程的链接质量和性能,这些数据集在记录匹配的大小和错误率方面有所不同。我们在基于编码记录(布隆过滤器)的记录链接和基于隐私保护的记录链接(Bloom 过滤器)之间进行了比较。此外,由于 Mainzelliste 软件没有提供阻止机制,我们通过语音阻止以及基于局部敏感哈希(LSH)的新阻止方案扩展了它,以提高标准和隐私保护记录链接的运行时。
Mainzelliste 使用字段级别的 Bloom 过滤器实现了高的 PPRL 链接质量,因为它使用了一种容错的匹配算法,可以处理名称中的差异,特别是缺少或错位的名字组合。然而,由于缺乏阻止机制,对于使用更大数据集的实际用例,运行时间是不可接受的。新实现的阻止方法将运行时间提高了几个数量级,同时保持了高的链接质量。
我们对 Mainzelliste 软件的记录链接功能进行了首次全面评估,并通过阻止方法对其进行了扩展,以提高其运行时性能。我们观察到,即使在存在错误的情况下,明文和编码数据的链接质量都非常高。提供的阻止方法在运行时性能方面提供了数量级的改进,从而促进了在具有大数据集和大量参与者的研究项目中的使用。