一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.

作者信息

Xu Huiping, Li Xiaochun, Grannis Shaun

机构信息

Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA.

Regenstrief Institute Inc., Indianapolis, IN, USA.

出版信息

J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.

DOI:10.1080/02664763.2021.1922615

PMID:35909667

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9336505/

Abstract

The widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.

摘要

广泛使用的用于概率性记录链接的费勒吉-桑特模型没有利用字段值中包含的信息，因此无论记录在稀有值还是常见值上是否一致，都会导致匹配状态的相同分类。由于在稀有值上的一致比在常见值上的一致偶然发生的可能性更小，所以在稀有值上一致的记录更有可能是匹配项。现有的基于频率的方法通常依赖于与字段值相关的错误概率以及匹配项中一致字段值的频率的知识，这些知识通常是通过先前的研究或训练数据得出的。当此类信息不可用时，这些方法的应用具有挑战性。在本文中，我们提出了一种简单的两步程序，用于在费勒吉-桑特框架下基于频率进行匹配，以克服这些挑战。匹配权重根据匹配项和非匹配项中一致字段值的频率分布进行调整，由费勒吉-桑特模型估计，而不依赖于先前的研究或训练数据。通过实际应用和模拟，我们发现我们的方法产生的性能与未调整的方法相当或更好。此外，当使用区分能力较差的字段时，基于频率的匹配在匹配准确性方面提供了更大的改进，随着匹配字段的区分能力增加，收益会减少。

相似文献

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.

The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection.数据自适应 Fellegi-Sunter 模型在概率记录链接中的应用：纳入缺失数据和字段选择的算法开发和验证。

J Med Internet Res. 2022 Sep 29;24(9):e33775. doi: 10.2196/33775.

Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators.扩展 Fellegi-Sunter 概率记录链接方法以用于近似字段比较器。

J Biomed Inform. 2010 Feb;43(1):24-30. doi: 10.1016/j.jbi.2009.08.004. Epub 2009 Aug 13.

Automated linkage of patient records from disparate sources.来自不同来源的患者记录的自动链接。

Stat Methods Med Res. 2018 Jan;27(1):172-184. doi: 10.1177/0962280215626180. Epub 2016 Jul 20.

Variable selection for latent class analysis in the presence of missing data with application to record linkage.存在缺失数据时的潜在类别分析的变量选择及其在记录链接中的应用。

Stat Methods Med Res. 2024 Jun;33(6):966-980. doi: 10.1177/09622802241242317. Epub 2024 Apr 9.

An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling.基于频率的权重缩放的概率记录链接算法的经验修正。

J Am Med Inform Assoc. 2009 Sep-Oct;16(5):738-45. doi: 10.1197/jamia.M3186. Epub 2009 Jun 30.

Probabilistic linkage of large public health data files.大型公共卫生数据文件的概率性关联

Stat Med. 1995;14(5-7):491-8. doi: 10.1002/sim.4780140510.

Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage.用于概率链接的布隆过滤器上近似比较方法的评估。

Int J Popul Data Sci. 2019 May 23;4(1):1095. doi: 10.23889/ijpds.v4i1.1095.

Controlling false match rates in record linkage using extreme value theory.利用极值理论控制记录匹配中的错误匹配率。

J Biomed Inform. 2011 Aug;44(4):648-54. doi: 10.1016/j.jbi.2011.02.008. Epub 2011 Feb 23.

A new computationally efficient algorithm for record linkage with field dependency and missing data imputation.一种新的具有字段依赖性和缺失数据插补功能的计算效率高的记录链接算法。

Int J Med Inform. 2018 Jan;109:70-75. doi: 10.1016/j.ijmedinf.2017.10.021. Epub 2017 Nov 6.

引用本文的文献

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation.去标识化贝叶斯个人身份匹配用于隐私保护记录链接，即使存在错误：开发和验证。

BMC Med Inform Decis Mak. 2023 May 5;23(1):85. doi: 10.1186/s12911-023-02176-6.

本文引用的文献

Evaluating latent class models with conditional dependence in record linkage.在记录链接中评估具有条件依赖性的潜在类别模型。

Stat Med. 2014 Oct 30;33(24):4250-65. doi: 10.1002/sim.6230. Epub 2014 Jun 17.

Improving record linkage performance in the presence of missing linkage data.在存在缺失链接数据的情况下提高记录链接性能。

J Biomed Inform. 2014 Dec;52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.

A practical approach for incorporating dependence among fields in probabilistic record linkage.一种实用的方法，用于在概率记录链接中纳入字段之间的依赖关系。

BMC Med Inform Decis Mak. 2013 Aug 30;13:97. doi: 10.1186/1472-6947-13-97.

Missing values in deduplication of electronic patient data.电子患者数据去重中的缺失值。

J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.

An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling.基于频率的权重缩放的概率记录链接算法的经验修正。

J Am Med Inform Assoc. 2009 Sep-Oct;16(5):738-45. doi: 10.1197/jamia.M3186. Epub 2009 Jun 30.

Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies.忽略链接变量之间的依赖性及其对概率记录链接研究结果的影响。

J Am Med Inform Assoc. 2008 Sep-Oct;15(5):654-60. doi: 10.1197/jamia.M2265. Epub 2008 Jun 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.

作者信息

Xu Huiping, Li Xiaochun, Grannis Shaun

机构信息

Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA.

Regenstrief Institute Inc., Indianapolis, IN, USA.

出版信息

J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.

DOI:10.1080/02664763.2021.1922615

PMID:35909667

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9336505/

Abstract

摘要

一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.

作者信息

机构信息

出版信息