Mamun Abdullah-Al, Aseltine Robert, Rajasekaran Sanguthevar
Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, United States of America.
Institute for Public Health Research, University of Connecticut, East Hartford, Connecticut, United States of America.
PLoS One. 2016 Apr 28;11(4):e0154446. doi: 10.1371/journal.pone.0154446. eCollection 2016.
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
来自不同机构的数据共享相同个体的数据。将这些数据集链接起来以识别属于同一人的所有记录是一个关键且具有挑战性的问题,尤其是考虑到数据量巨大。大量现有的记录链接算法在查找记录之间的匹配项和不匹配项时,要么效率低下,要么准确率低。在本文中,我们提出了使用层次聚类方法解决记录链接问题的高效且可靠的顺序和并行算法。我们采用完全链接层次聚类算法来解决这个问题。除了层次聚类,我们还使用另外两种技术:消除重复记录和分块。我们的算法使用排序作为子例程来识别记录的相同副本。我们在包含数百万条合成记录的数据集上测试了我们的算法。实验结果表明,我们的算法准确率接近100%。并行实现几乎实现了线性加速。这些算法的时间复杂度不超过之前最著名算法的时间复杂度。我们提出的算法在准确率方面优于之前最著名的算法,且运行时间合理。