Clinical Research Facility, National University of Ireland, Galway, Ireland.
Graduate Entry Medical School, University of Limerick, Ireland.
Int J Med Inform. 2018 Jan;109:70-75. doi: 10.1016/j.ijmedinf.2017.10.021. Epub 2017 Nov 6.
Record linkage algorithms aim to identify pairs of records that correspond to the same individual from two or more datasets. In general, fields that are common to both datasets are compared to determine which record-pairs to link. The classic model for probabilistic linkage was proposed by Fellegi and Sunter and assumes that individual fields common to both datasets are completely observed, and that the field agreement indicators are conditionally independent within the subsets of record pairs corresponding to the same and differing individuals. Herein, we propose a novel record linkage algorithm that is independent of these two baseline assumptions. We demonstrate improved performance of the algorithm in the presence of missing data and correlation patterns between the agreement indicators. The algorithm is computationally efficient and can be used to link large databases consisting of millions of record pairs. An R-package, corlink, has been developed to implement the new algorithm and can be downloaded from the CRAN repository.
记录链接算法旨在从两个或多个数据集识别对应于同一个体的记录对。通常,比较两个数据集共有的字段以确定要链接的记录对。Fellegi 和 Sunter 提出了用于概率链接的经典模型,该模型假设两个数据集共有的各个字段都是完全观测到的,并且字段一致性指标在对应于相同和不同个体的记录对子集内是条件独立的。在此,我们提出了一种新的记录链接算法,该算法独立于这两个基本假设。我们证明了在存在缺失数据和一致性指标之间存在相关模式的情况下,该算法的性能得到了提高。该算法计算效率高,可用于链接由数百万条记录对组成的大型数据库。已开发了一个 R 包 corlink 来实现新算法,并可以从 CRAN 存储库下载。