Goldstein Harvey, Harron Katie, Cortina-Borja Mario
University of Bristol, Bristol, U.K.
University College London, London, U.K.
Stat Med. 2017 Jul 20;36(16):2514-2521. doi: 10.1002/sim.7287. Epub 2017 Mar 16.
With increasing availability of large datasets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. In this paper, we suggest a new approach to classifying record pairs in linkage, based upon weights (scores) derived using a scaling algorithm. The proposed method does not rely on training data, is computationally fast, requires only moderate amounts of storage and has intuitive appeal. Copyright © 2017 John Wiley & Sons, Ltd.
随着从行政及其他来源获得的大型数据集越来越多,人们对成功链接这些数据集以提供丰富数据来源用于进一步分析的需求也日益增加。用于进行链接的标识符质量存在差异,这意味着现有方法通常基于“概率”模型,这些模型基于一些假设,并且可能需要大量计算。在本文中,我们提出了一种基于使用缩放算法得出的权重(分数)对链接中的记录对进行分类的新方法。所提出的方法不依赖训练数据,计算速度快,只需要适度的存储量,并且具有直观的吸引力。版权所有© 2017约翰·威利父子有限公司。