Lam Joseph, Boyd Andy, Linacre Robin, Blackburn Ruth, Harron Katie
Population, Policy & Practice Research and Teaching Department, UCL Great Ormond Street Institute of Child Health, London, United Kingdom.
Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom.
Int J Popul Data Sci. 2024 Jul 1;9(1):2389. doi: 10.23889/ijpds.v9i1.2389. eCollection 2024.
Careful development and evaluation of data linkage methods is limited by researcher access to personal identifiers. One solution is to generate synthetic identifiers, which do not pose equivalent privacy concerns, but can form a 'gold-standard' linkage algorithm training dataset. Such data could help inform choices about appropriate linkage strategies in different settings.
We aimed to develop and demonstrate a framework for generating synthetic identifier datasets to support development and evaluation of data linkage methods. We evaluated whether replicating associations between attributes and identifiers improved the utility of the synthetic data for assessing linkage error.
We determined the steps required to generate synthetic identifiers that replicate the properties of real-world data collection. We then generated synthetic versions of a large UK cohort study (the Avon Longitudinal Study of Parents and Children; ALSPAC), according to the quality and completeness of identifiers recorded over several waves of the cohort. We evaluated the utility of the synthetic identifier data in terms of assessing linkage quality (false matches and missed matches).
Comparing data from two collection points in ALSPAC, we found within-person disagreement in identifiers (differences in recording due to both natural change and non-valid entries) in 18% of surnames and 12% of forenames. Rates of disagreement varied by maternal age and ethnic group. Synthetic data provided accurate estimates of linkage quality metrics compared with the original data (within 0.13-0.55% for missed matches and 0.00-0.04% for false matches). Incorporating associations between identifier errors and maternal age/ethnicity improved synthetic data utility.
We show that replicating dependencies between attribute values (e.g. ethnicity), values of identifiers (e.g. name), identifier disagreements (e.g. missing values, errors or changes over time), and their patterns and distribution structure enables generation of realistic synthetic data that can be used for robust evaluation of linkage methods.
数据链接方法的精心开发和评估受到研究人员获取个人标识符的限制。一种解决方案是生成合成标识符,它不会带来同等的隐私问题,但可以形成一个“黄金标准”链接算法训练数据集。此类数据有助于为不同环境下合适的链接策略选择提供参考。
我们旨在开发并展示一个用于生成合成标识符数据集的框架,以支持数据链接方法的开发和评估。我们评估了复制属性与标识符之间的关联是否能提高合成数据用于评估链接错误的效用。
我们确定了生成能复制真实世界数据收集属性的合成标识符所需的步骤。然后,根据在该队列多轮记录中标识符的质量和完整性,生成了一项大型英国队列研究(埃文亲子纵向研究;ALSPAC)的合成版本。我们从评估链接质量(错误匹配和漏匹配)的角度评估了合成标识符数据的效用。
比较ALSPAC两个收集点的数据,我们发现18%的姓氏和12%的名字存在个体内部标识符不一致的情况(由于自然变化和无效条目导致的记录差异)。不一致率因母亲年龄和种族而异。与原始数据相比,合成数据提供了准确的链接质量指标估计(漏匹配率在0.13 - 0.55%之间,错误匹配率在0.00 - 0.04%之间)。纳入标识符错误与母亲年龄/种族之间的关联提高了合成数据的效用。
我们表明,复制属性值(如种族)、标识符值(如姓名)、标识符不一致情况(如缺失值、错误或随时间的变化)之间的依赖关系及其模式和分布结构,能够生成可用于对链接方法进行稳健评估的逼真合成数据。