Li Xiaochun, Xu Huiping, Shen Changyu, Grannis Shaun
Indiana University School of Medicine, Indianapolis, USA.
Stat Methods Med Res. 2018 Jan;27(1):172-184. doi: 10.1177/0962280215626180. Epub 2016 Jul 20.
We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.
我们介绍了一种自动记录链接方法,该方法有两个关键特性,即自动选择要纳入估计模型的匹配字段交互,以及自动确定用于将记录对分类为匹配或不匹配的阈值。我们将我们的方法应用于两个实际示例。第一个示例展示的结果与我们早期的工作一致:当数据质量足够且匹配字段的区分能力较高时,匹配算法表现出相似的性能。第二个示例表明,在数据质量较低的情况下,我们的方法比费勒吉 - 桑特模型产生更低的误报率和更高的阳性预测值。与费勒吉 - 桑特模型相比,模拟研究表明,我们的方法表现出更好的整体性能,如更高的曲线下面积所示,并且在一系列数据场景中,对于匹配流行率以及m概率和u概率的估计偏差更小,尤其是当匹配流行率极端时。在计算方面,我们的方法与费勒吉 - 桑特模型一样高效。我们建议在需要无监督链接算法的情况下使用此方法。