Suppr超能文献

一种新的具有字段依赖性和缺失数据插补功能的计算效率高的记录链接算法。

A new computationally efficient algorithm for record linkage with field dependency and missing data imputation.

机构信息

Clinical Research Facility, National University of Ireland, Galway, Ireland.

Graduate Entry Medical School, University of Limerick, Ireland.

出版信息

Int J Med Inform. 2018 Jan;109:70-75. doi: 10.1016/j.ijmedinf.2017.10.021. Epub 2017 Nov 6.

Abstract

Record linkage algorithms aim to identify pairs of records that correspond to the same individual from two or more datasets. In general, fields that are common to both datasets are compared to determine which record-pairs to link. The classic model for probabilistic linkage was proposed by Fellegi and Sunter and assumes that individual fields common to both datasets are completely observed, and that the field agreement indicators are conditionally independent within the subsets of record pairs corresponding to the same and differing individuals. Herein, we propose a novel record linkage algorithm that is independent of these two baseline assumptions. We demonstrate improved performance of the algorithm in the presence of missing data and correlation patterns between the agreement indicators. The algorithm is computationally efficient and can be used to link large databases consisting of millions of record pairs. An R-package, corlink, has been developed to implement the new algorithm and can be downloaded from the CRAN repository.

摘要

记录链接算法旨在从两个或多个数据集识别对应于同一个体的记录对。通常,比较两个数据集共有的字段以确定要链接的记录对。Fellegi 和 Sunter 提出了用于概率链接的经典模型,该模型假设两个数据集共有的各个字段都是完全观测到的,并且字段一致性指标在对应于相同和不同个体的记录对子集内是条件独立的。在此,我们提出了一种新的记录链接算法,该算法独立于这两个基本假设。我们证明了在存在缺失数据和一致性指标之间存在相关模式的情况下,该算法的性能得到了提高。该算法计算效率高,可用于链接由数百万条记录对组成的大型数据库。已开发了一个 R 包 corlink 来实现新算法,并可以从 CRAN 存储库下载。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验