• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成合成标识符以支持数据链接方法的开发和评估。

Generating synthetic identifiers to support development and evaluation of data linkage methods.

作者信息

Lam Joseph, Boyd Andy, Linacre Robin, Blackburn Ruth, Harron Katie

机构信息

Population, Policy & Practice Research and Teaching Department, UCL Great Ormond Street Institute of Child Health, London, United Kingdom.

Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom.

出版信息

Int J Popul Data Sci. 2024 Jul 1;9(1):2389. doi: 10.23889/ijpds.v9i1.2389. eCollection 2024.

DOI:10.23889/ijpds.v9i1.2389
PMID:39620124
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11606631/
Abstract

INTRODUCTION

Careful development and evaluation of data linkage methods is limited by researcher access to personal identifiers. One solution is to generate synthetic identifiers, which do not pose equivalent privacy concerns, but can form a 'gold-standard' linkage algorithm training dataset. Such data could help inform choices about appropriate linkage strategies in different settings.

OBJECTIVES

We aimed to develop and demonstrate a framework for generating synthetic identifier datasets to support development and evaluation of data linkage methods. We evaluated whether replicating associations between attributes and identifiers improved the utility of the synthetic data for assessing linkage error.

METHODS

We determined the steps required to generate synthetic identifiers that replicate the properties of real-world data collection. We then generated synthetic versions of a large UK cohort study (the Avon Longitudinal Study of Parents and Children; ALSPAC), according to the quality and completeness of identifiers recorded over several waves of the cohort. We evaluated the utility of the synthetic identifier data in terms of assessing linkage quality (false matches and missed matches).

RESULTS

Comparing data from two collection points in ALSPAC, we found within-person disagreement in identifiers (differences in recording due to both natural change and non-valid entries) in 18% of surnames and 12% of forenames. Rates of disagreement varied by maternal age and ethnic group. Synthetic data provided accurate estimates of linkage quality metrics compared with the original data (within 0.13-0.55% for missed matches and 0.00-0.04% for false matches). Incorporating associations between identifier errors and maternal age/ethnicity improved synthetic data utility.

CONCLUSIONS

We show that replicating dependencies between attribute values (e.g. ethnicity), values of identifiers (e.g. name), identifier disagreements (e.g. missing values, errors or changes over time), and their patterns and distribution structure enables generation of realistic synthetic data that can be used for robust evaluation of linkage methods.

摘要

引言

数据链接方法的精心开发和评估受到研究人员获取个人标识符的限制。一种解决方案是生成合成标识符,它不会带来同等的隐私问题,但可以形成一个“黄金标准”链接算法训练数据集。此类数据有助于为不同环境下合适的链接策略选择提供参考。

目的

我们旨在开发并展示一个用于生成合成标识符数据集的框架,以支持数据链接方法的开发和评估。我们评估了复制属性与标识符之间的关联是否能提高合成数据用于评估链接错误的效用。

方法

我们确定了生成能复制真实世界数据收集属性的合成标识符所需的步骤。然后,根据在该队列多轮记录中标识符的质量和完整性,生成了一项大型英国队列研究(埃文亲子纵向研究;ALSPAC)的合成版本。我们从评估链接质量(错误匹配和漏匹配)的角度评估了合成标识符数据的效用。

结果

比较ALSPAC两个收集点的数据,我们发现18%的姓氏和12%的名字存在个体内部标识符不一致的情况(由于自然变化和无效条目导致的记录差异)。不一致率因母亲年龄和种族而异。与原始数据相比,合成数据提供了准确的链接质量指标估计(漏匹配率在0.13 - 0.55%之间,错误匹配率在0.00 - 0.04%之间)。纳入标识符错误与母亲年龄/种族之间的关联提高了合成数据的效用。

结论

我们表明,复制属性值(如种族)、标识符值(如姓名)、标识符不一致情况(如缺失值、错误或随时间的变化)之间的依赖关系及其模式和分布结构,能够生成可用于对链接方法进行稳健评估的逼真合成数据。

相似文献

1
Generating synthetic identifiers to support development and evaluation of data linkage methods.生成合成标识符以支持数据链接方法的开发和评估。
Int J Popul Data Sci. 2024 Jul 1;9(1):2389. doi: 10.23889/ijpds.v9i1.2389. eCollection 2024.
2
Utilising identifier error variation in linkage of large administrative data sources.利用大型行政数据源链接中的标识符错误变异。
BMC Med Res Methodol. 2017 Feb 7;17(1):23. doi: 10.1186/s12874-017-0306-8.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records.将假名化算法应用于儿科重症监护记录时医院管理数据中的数据链接错误。
BMJ Open. 2015 Aug 21;5(8):e008118. doi: 10.1136/bmjopen-2015-008118.
5
Evaluation of record linkage of two large administrative databases in a middle income country: stillbirths and notifications of dengue during pregnancy in Brazil.中等收入国家两个大型行政数据库的记录关联评估:巴西的死产与孕期登革热通报情况
BMC Med Inform Decis Mak. 2017 Jul 17;17(1):108. doi: 10.1186/s12911-017-0506-5.
6
De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation.去标识化贝叶斯个人身份匹配用于隐私保护记录链接,即使存在错误:开发和验证。
BMC Med Inform Decis Mak. 2023 May 5;23(1):85. doi: 10.1186/s12911-023-02176-6.
7
Evaluating bias due to data linkage error in electronic healthcare records.评估电子医疗记录中因数据链接错误导致的偏差。
BMC Med Res Methodol. 2014 Mar 5;14:36. doi: 10.1186/1471-2288-14-36.
8
Validating a novel deterministic privacy-preserving record linkage between administrative & clinical data: applications in stroke research.验证一种新颖的行政与临床数据确定性隐私保护记录链接方法:在中风研究中的应用。
Int J Popul Data Sci. 2022 Nov 22;7(4):1755. doi: 10.23889/ijpds.v7i4.1755. eCollection 2022.
9
Privacy-Preserving Record Linkage of Deidentified Records Within a Public Health Surveillance System: Evaluation Study.公共卫生监测系统中去识别化记录的隐私保护记录链接:评估研究
J Med Internet Res. 2020 Jun 24;22(6):e16757. doi: 10.2196/16757.
10
Linking education and hospital data in England: linkage process and quality.链接英格兰的教育和医院数据:链接过程和质量。
Int J Popul Data Sci. 2021 Sep 16;6(1):1671. doi: 10.23889/ijpds.v6i1.1671. eCollection 2021.

本文引用的文献

1
An overview of synthetic administrative data for research.合成行政数据研究概述。
Int J Popul Data Sci. 2022 May 23;7(1):1727. doi: 10.23889/ijpds.v7i1.1727. eCollection 2022.
2
Prevalence of Down's Syndrome in England, 1998-2013: Comparison of linked surveillance data and electronic health records.1998 - 2013年英格兰唐氏综合征的患病率:关联监测数据与电子健康记录的比较
Int J Popul Data Sci. 2020 Mar 19;5(1):1157. doi: 10.23889/ijpds.v5i1.1157. eCollection 2020 Jan 30.
3
Assessing data linkage quality in cohort studies.
评估队列研究中的数据链接质量。
Ann Hum Biol. 2020 Mar;47(2):218-226. doi: 10.1080/03014460.2020.1742379.
4
Reflections on modern methods: linkage error bias.关于现代方法的思考:连锁错误偏差。
Int J Epidemiol. 2019 Dec 1;48(6):2050-2060. doi: 10.1093/ije/dyz203.
5
Challenges in administrative data linkage for research.研究中行政数据链接的挑战。
Big Data Soc. 2017 Dec 5;4(2):2053951717745678. doi: 10.1177/2053951717745678.
6
A guide to evaluating linkage quality for the analysis of linked data.用于分析关联数据的链接质量评估指南。
Int J Epidemiol. 2017 Oct 1;46(5):1699-1710. doi: 10.1093/ije/dyx177.
7
Utilising identifier error variation in linkage of large administrative data sources.利用大型行政数据源链接中的标识符错误变异。
BMC Med Res Methodol. 2017 Feb 7;17(1):23. doi: 10.1186/s12874-017-0306-8.
8
Linking Data for Mothers and Babies in De-Identified Electronic Health Data.在去识别化电子健康数据中关联母婴数据
PLoS One. 2016 Oct 20;11(10):e0164667. doi: 10.1371/journal.pone.0164667. eCollection 2016.
9
Routinely collected data as a strategic resource for research: priorities for methods and workforce.作为研究战略资源的常规收集数据:方法与人员配置的优先事项
Public Health Res Pract. 2015 Sep 30;25(4):e2541540. doi: 10.17061/phrp2541540.
10
Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies.使用公共卫生与流行病学研究增强匹配系统的概率性链接的准确性
PLoS One. 2015 Aug 24;10(8):e0136179. doi: 10.1371/journal.pone.0136179. eCollection 2015.