Suppr超能文献

在存在缺失链接数据的情况下提高记录链接性能。

Improving record linkage performance in the presence of missing linkage data.

作者信息

Ong Toan C, Mannino Michael V, Schilling Lisa M, Kahn Michael G

机构信息

University of Colorado, Denver, Business School, Denver, CO, USA; Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.

University of Colorado, Denver, Business School, Denver, CO, USA.

出版信息

J Biomed Inform. 2014 Dec;52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.

Abstract

INTRODUCTION

Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values.

METHODS

By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates.

RESULTS

The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods.

CONCLUSIONS

These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.

摘要

引言

现有的记录链接方法无法高效且有效地处理缺失的链接字段值。本研究的目的是探索三种新方法,以提高在记录链接字段存在缺失值时记录链接的准确性和效率。

方法

通过扩展开源细粒度记录链接(FRIL)软件系统中可用的费勒吉 - 桑特计分实现方式,我们开发了三种新方法来解决记录链接中的缺失数据问题,我们将其称为:权重重新分配、距离插补和链接扩展。权重重新分配从准标识符集合中移除具有缺失数据的字段,并根据其余可用链接字段的相对比例重新分配缺失属性的权重。距离插补对缺失数据字段之间的距离进行插补,而不是插补缺失数据值。链接扩展将先前视为非链接字段的字段添加到链接字段集中,以补偿链接字段中的缺失信息。我们使用具有不同字段值损坏率的模拟数据集测试了这些链接方法。

结果

在低损坏率的数据集中,所开发的方法灵敏度范围为0.895至0.992,阳性预测值(PPV)范围为0.865至1。所有方法的损坏率增加都会导致灵敏度降低。

结论

这些新的记录链接算法在准确性和效率方面显示出前景,对于在患者层面合并大型数据集以支持生物医学和临床研究可能具有重要价值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验