Xu Huiping, Li Xiaochun, Grannis Shaun
Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA.
Regenstrief Institute Inc., Indianapolis, IN, USA.
J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.
The widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.
广泛使用的用于概率性记录链接的费勒吉-桑特模型没有利用字段值中包含的信息,因此无论记录在稀有值还是常见值上是否一致,都会导致匹配状态的相同分类。由于在稀有值上的一致比在常见值上的一致偶然发生的可能性更小,所以在稀有值上一致的记录更有可能是匹配项。现有的基于频率的方法通常依赖于与字段值相关的错误概率以及匹配项中一致字段值的频率的知识,这些知识通常是通过先前的研究或训练数据得出的。当此类信息不可用时,这些方法的应用具有挑战性。在本文中,我们提出了一种简单的两步程序,用于在费勒吉-桑特框架下基于频率进行匹配,以克服这些挑战。匹配权重根据匹配项和非匹配项中一致字段值的频率分布进行调整,由费勒吉-桑特模型估计,而不依赖于先前的研究或训练数据。通过实际应用和模拟,我们发现我们的方法产生的性能与未调整的方法相当或更好。此外,当使用区分能力较差的字段时,基于频率的匹配在匹配准确性方面提供了更大的改进,随着匹配字段的区分能力增加,收益会减少。