Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA.
J Am Med Inform Assoc. 2014 Mar-Apr;21(2):252-62. doi: 10.1136/amiajnl-2013-002034. Epub 2013 Oct 23.
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms.
Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components.
Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes).
We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
整合来自多个来源的数据是一个至关重要且具有挑战性的问题。尽管存在许多用于记录链接或去重的算法,但它们要么需要大量时间,要么受到它们可以整合的数据集数量的限制。在本文中,我们报告了用于记录链接的高效顺序和并行算法,这些算法可以处理任意数量的数据集,并优于以前的算法。
我们的算法采用层次聚类算法作为基础。我们使用的一个关键思想是对某些属性进行基数排序,以在进行任何进一步处理之前消除相同的记录。另一个新颖的想法是形成一个链接相似记录并找到连通分量的图。
我们的顺序和并行算法已经在一个包含 1,083,878 条记录的真实数据集和大小从 50,000 到 9,000,000 条记录的合成数据集上进行了测试。对于任何数据集,我们的顺序算法的运行速度至少比以前最快的算法(使用更快的编辑距离计算的两阶段算法(TPA (FCED))快两倍。我们的并行算法获得的加速几乎是线性的。例如,我们使用 8 核(驻留在单个节点中)获得 7.5 的加速,使用 16 核(驻留在两个节点中)获得 14.1 的加速,使用 32 核(驻留在四个节点中)获得 26.4 的加速。
我们比较了我们的顺序算法与 TPA (FCED) 的性能,发现我们的算法优于以前的算法。准确性与以前最快的算法相同。