Suppr超能文献

一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.

作者信息

Xu Huiping, Li Xiaochun, Grannis Shaun

机构信息

Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA.

Regenstrief Institute Inc., Indianapolis, IN, USA.

出版信息

J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.

Abstract

The widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.

摘要

广泛使用的用于概率性记录链接的费勒吉-桑特模型没有利用字段值中包含的信息,因此无论记录在稀有值还是常见值上是否一致,都会导致匹配状态的相同分类。由于在稀有值上的一致比在常见值上的一致偶然发生的可能性更小,所以在稀有值上一致的记录更有可能是匹配项。现有的基于频率的方法通常依赖于与字段值相关的错误概率以及匹配项中一致字段值的频率的知识,这些知识通常是通过先前的研究或训练数据得出的。当此类信息不可用时,这些方法的应用具有挑战性。在本文中,我们提出了一种简单的两步程序,用于在费勒吉-桑特框架下基于频率进行匹配,以克服这些挑战。匹配权重根据匹配项和非匹配项中一致字段值的频率分布进行调整,由费勒吉-桑特模型估计,而不依赖于先前的研究或训练数据。通过实际应用和模拟,我们发现我们的方法产生的性能与未调整的方法相当或更好。此外,当使用区分能力较差的字段时,基于频率的匹配在匹配准确性方面提供了更大的改进,随着匹配字段的区分能力增加,收益会减少。

相似文献

4
Automated linkage of patient records from disparate sources.来自不同来源的患者记录的自动链接。
Stat Methods Med Res. 2018 Jan;27(1):172-184. doi: 10.1177/0962280215626180. Epub 2016 Jul 20.
9

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验