Ong Toan C, Mannino Michael V, Schilling Lisa M, Kahn Michael G
University of Colorado, Denver, Business School, Denver, CO, USA; Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
University of Colorado, Denver, Business School, Denver, CO, USA.
J Biomed Inform. 2014 Dec;52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.
Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values.
By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates.
The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods.
These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.
现有的记录链接方法无法高效且有效地处理缺失的链接字段值。本研究的目的是探索三种新方法,以提高在记录链接字段存在缺失值时记录链接的准确性和效率。
通过扩展开源细粒度记录链接(FRIL)软件系统中可用的费勒吉 - 桑特计分实现方式,我们开发了三种新方法来解决记录链接中的缺失数据问题,我们将其称为:权重重新分配、距离插补和链接扩展。权重重新分配从准标识符集合中移除具有缺失数据的字段,并根据其余可用链接字段的相对比例重新分配缺失属性的权重。距离插补对缺失数据字段之间的距离进行插补,而不是插补缺失数据值。链接扩展将先前视为非链接字段的字段添加到链接字段集中,以补偿链接字段中的缺失信息。我们使用具有不同字段值损坏率的模拟数据集测试了这些链接方法。
在低损坏率的数据集中,所开发的方法灵敏度范围为0.895至0.992,阳性预测值(PPV)范围为0.865至1。所有方法的损坏率增加都会导致灵敏度降低。
这些新的记录链接算法在准确性和效率方面显示出前景,对于在患者层面合并大型数据集以支持生物医学和临床研究可能具有重要价值。