NIHR Biomedical Research Centre, John Radcliffe Hospital, Oxford, UK.
BMC Med Inform Decis Mak. 2011 Feb 1;11:7. doi: 10.1186/1472-6947-11-7.
Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.
We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.
The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.
The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.
个体信息的整合(记录链接)是医疗保健、流行病学和“商业智能”应用中的一个关键问题。现在,通常需要链接大量的记录,这些记录通常包含各种理论上唯一的标识符组合,例如 NHS 号码,这些标识符既不完整又容易出错。
我们描述了一种两步记录链接算法,其中标识具有高基数的标识符被识别或生成,并用于执行初始精确匹配的链接。随后,研究由此产生的集群,如果合适,使用基于图的算法检测错误标识符进行分区。
该系统用于聚类来自英国一家大型医院集团的五个数据源的超过 2.5 亿条健康记录。链接在大约 30 分钟内完成,产生了 360 万个集群,其中约 99.8%包含来自一个患者的记录,可能性很高。尽管算法在计算上是高效的,但对于每个记录的至少一个标识符与另一个记录的精确匹配以形成集群的要求可能是某些包含标识符质量较低的记录的数据库的一个限制。
所描述的技术提供了一种简单、快速和高效的两步方法,用于对英国国民保健制度中常见的记录进行大规模初始链接。