Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA, United States.
HIV/STD Program, Public Health-Seattle and King County, Seattle, WA, United States.
JMIR Public Health Surveill. 2020 Apr 30;6(2):e15917. doi: 10.2196/15917.
Many public health departments use record linkage between surveillance data and external data sources to inform public health interventions. However, little guidance is available to inform these activities, and many health departments rely on deterministic algorithms that may miss many true matches. In the context of public health action, these missed matches lead to missed opportunities to deliver interventions and may exacerbate existing health inequities.
This study aimed to compare the performance of record linkage algorithms commonly used in public health practice.
We compared five deterministic (exact, Stenger, Ocampo 1, Ocampo 2, and Bosh) and two probabilistic record linkage algorithms (fastLink and beta record linkage [BRL]) using simulations and a real-world scenario. We simulated pairs of datasets with varying numbers of errors per record and the number of matching records between the two datasets (ie, overlap). We matched the datasets using each algorithm and calculated their recall (ie, sensitivity, the proportion of true matches identified by the algorithm) and precision (ie, positive predictive value, the proportion of matches identified by the algorithm that were true matches). We estimated the average computation time by performing a match with each algorithm 20 times while varying the size of the datasets being matched. In a real-world scenario, HIV and sexually transmitted disease surveillance data from King County, Washington, were matched to identify people living with HIV who had a syphilis diagnosis in 2017. We calculated the recall and precision of each algorithm compared with a composite standard based on the agreement in matching decisions across all the algorithms and manual review.
In simulations, BRL and fastLink maintained a high recall at nearly all data quality levels, while being comparable with deterministic algorithms in terms of precision. Deterministic algorithms typically failed to identify matches in scenarios with low data quality. All the deterministic algorithms had a shorter average computation time than the probabilistic algorithms. BRL had the slowest overall computation time (14 min when both datasets contained 2000 records). In the real-world scenario, BRL had the lowest trade-off between recall (309/309, 100.0%) and precision (309/312, 99.0%).
Probabilistic record linkage algorithms maximize the number of true matches identified, reducing gaps in the coverage of interventions and maximizing the reach of public health action.
许多公共卫生部门使用监测数据与外部数据源之间的记录链接来为公共卫生干预措施提供信息。然而,几乎没有指导来为这些活动提供信息,而且许多卫生部门依赖于可能错过许多真实匹配的确定性算法。在公共卫生行动的背景下,这些错过的匹配会导致错失提供干预措施的机会,并可能加剧现有的健康不平等。
本研究旨在比较公共卫生实践中常用的记录链接算法的性能。
我们使用模拟和现实场景比较了五种确定性(精确、Stenger、Ocampo1、Ocampo2 和 Bosh)和两种概率性记录链接算法(fastLink 和 beta 记录链接[BRL])。我们模拟了具有不同记录错误数量和两个数据集之间匹配记录数量(即重叠)的数据集对。我们使用每个算法匹配数据集,并计算它们的召回率(即灵敏度,算法识别的真实匹配的比例)和精度(即阳性预测值,算法识别的匹配中真实匹配的比例)。我们通过对每个算法执行 20 次匹配,同时改变要匹配的数据集的大小,来估计平均计算时间。在现实场景中,华盛顿州金县的艾滋病毒和性传播疾病监测数据被匹配,以确定在 2017 年患有梅毒诊断的艾滋病毒感染者。我们计算了每个算法与基于所有算法匹配决策一致性的综合标准以及手动审查的召回率和精度。
在模拟中,BRL 和 fastLink 在几乎所有数据质量水平下都保持高召回率,同时在精度方面与确定性算法相当。在数据质量较低的情况下,确定性算法通常无法识别匹配。所有确定性算法的平均计算时间都比概率性算法短。BRL 的整体计算时间最慢(当两个数据集都包含 2000 条记录时,需要 14 分钟)。在现实场景中,BRL 在召回率(309/309,100.0%)和精度(309/312,99.0%)之间的权衡最低。
概率性记录链接算法最大限度地提高了识别的真实匹配数量,减少了干预措施覆盖范围的差距,并最大限度地扩大了公共卫生行动的范围。