Suppr超能文献

一种使用图形分析的高效记录链接方案,用于标识符错误检测。

An efficient record linkage scheme using graphical analysis for identifier error detection.

机构信息

NIHR Biomedical Research Centre, John Radcliffe Hospital, Oxford, UK.

出版信息

BMC Med Inform Decis Mak. 2011 Feb 1;11:7. doi: 10.1186/1472-6947-11-7.

Abstract

BACKGROUND

Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.

METHODS

We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.

RESULTS

The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.

CONCLUSIONS

The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.

摘要

背景

个体信息的整合(记录链接)是医疗保健、流行病学和“商业智能”应用中的一个关键问题。现在,通常需要链接大量的记录,这些记录通常包含各种理论上唯一的标识符组合,例如 NHS 号码,这些标识符既不完整又容易出错。

方法

我们描述了一种两步记录链接算法,其中标识具有高基数的标识符被识别或生成,并用于执行初始精确匹配的链接。随后,研究由此产生的集群,如果合适,使用基于图的算法检测错误标识符进行分区。

结果

该系统用于聚类来自英国一家大型医院集团的五个数据源的超过 2.5 亿条健康记录。链接在大约 30 分钟内完成,产生了 360 万个集群,其中约 99.8%包含来自一个患者的记录,可能性很高。尽管算法在计算上是高效的,但对于每个记录的至少一个标识符与另一个记录的精确匹配以形成集群的要求可能是某些包含标识符质量较低的记录的数据库的一个限制。

结论

所描述的技术提供了一种简单、快速和高效的两步方法,用于对英国国民保健制度中常见的记录进行大规模初始链接。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2126/3039555/60a2cc0f2cc7/1472-6947-11-7-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验