• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用大型行政数据源链接中的标识符错误变异。

Utilising identifier error variation in linkage of large administrative data sources.

作者信息

Harron Katie, Hagger-Johnson Gareth, Gilbert Ruth, Goldstein Harvey

机构信息

London School of Hygiene and Tropical Medicine, 15-17 Tavistock Place, London, WC1 H 9SH, UK.

Administrative Data Research Centre for England, UCL, 222 Euston Road, London, NW1 2DA, UK.

出版信息

BMC Med Res Methodol. 2017 Feb 7;17(1):23. doi: 10.1186/s12874-017-0306-8.

DOI:10.1186/s12874-017-0306-8
PMID:28173759
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5297137/
Abstract

BACKGROUND

Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors, violating the assumption of independence between identifiers required for traditional probabilistic match weight estimation. This potentially introduces selection bias to the resulting linked dataset. We aimed to measure variation in identifier error rates in a large English administrative data source (Hospital Episode Statistics; HES) and to incorporate this information into match weight calculation.

METHODS

We used 30,000 randomly selected HES hospital admissions records of patients aged 0-1, 5-6 and 18-19 years, for 2011/2012, linked via NHS number with data from the Personal Demographic Service (PDS; our gold-standard). We calculated identifier error rates for sex, date of birth and postcode and used multi-level logistic regression to investigate associations with individual-level attributes (age, ethnicity, and gender) and organisational variation. We then derived: i) weights incorporating dependence between identifiers; ii) attribute-specific weights (varying by age, ethnicity and gender); and iii) organisation-specific weights (by hospital). Results were compared with traditional match weights using a simulation study.

RESULTS

Identifier errors (where values disagreed in linked HES-PDS records) or missing values were found in 0.11% of records for sex and date of birth and in 53% of records for postcode. Identifier error rates differed significantly by age, ethnicity and sex (p < 0.0005). Errors were less frequent in males, in 5-6 year olds and 18-19 year olds compared with infants, and were lowest for the Asian ethic group. A simulation study demonstrated that substantial bias was introduced into estimated readmission rates in the presence of identifier errors. Attribute- and organisational-specific weights reduced this bias compared with weights estimated using traditional probabilistic matching algorithms.

CONCLUSIONS

We provide empirical evidence on variation in rates of identifier error in a widely-used administrative data source and propose a new method for deriving match weights that incorporates additional data attributes. Our results demonstrate that incorporating information on variation by individual-level characteristics can help to reduce bias due to linkage error.

摘要

背景

行政数据源的链接通常依赖于使用一组通用标识符(如性别、出生日期、邮政编码)的概率方法。个体或组织层面(如按医院)的数据质量差异可能导致标识符错误的聚集,从而违反传统概率匹配权重估计所需的标识符之间独立性的假设。这可能会给最终的链接数据集带来选择偏差。我们旨在测量一个大型英国行政数据源(医院 Episode 统计;HES)中标识符错误率的差异,并将此信息纳入匹配权重计算。

方法

我们使用了 2011/2012 年随机选取的 30000 份 HES 医院入院记录,这些记录涉及 0 - 1 岁、5 - 6 岁和 18 - 19 岁的患者,通过国民健康服务号码与个人人口统计服务(PDS;我们的金标准)的数据进行链接。我们计算了性别、出生日期和邮政编码的标识符错误率,并使用多水平逻辑回归来研究与个体层面属性(年龄、种族和性别)以及组织差异的关联。然后我们得出:i)纳入标识符之间依赖性的权重;ii)特定属性权重(因年龄、种族和性别而异);以及 iii)特定组织权重(按医院)。通过模拟研究将结果与传统匹配权重进行比较。

结果

在链接的 HES - PDS 记录中,性别和出生日期记录的 0.11%以及邮政编码记录的 53%存在标识符错误(即值不一致)或缺失值。标识符错误率在年龄、种族和性别上有显著差异(p < 0.0005)。与婴儿相比,男性、5 - 6 岁和 18 - 19 岁人群中的错误频率较低,亚洲种族群体的错误率最低。一项模拟研究表明,在存在标识符错误的情况下,估计的再入院率会引入大量偏差。与使用传统概率匹配算法估计的权重相比,特定属性和特定组织的权重减少了这种偏差。

结论

我们提供了关于一个广泛使用的行政数据源中标识符错误率差异的实证证据,并提出了一种推导匹配权重的新方法,该方法纳入了额外的数据属性。我们的结果表明,纳入个体层面特征差异的信息有助于减少因链接错误导致的偏差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/2002c6f23e19/12874_2017_306_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/ef3f34f5a3dc/12874_2017_306_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/78012df2c6cc/12874_2017_306_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/f4c2aca828bb/12874_2017_306_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/2002c6f23e19/12874_2017_306_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/ef3f34f5a3dc/12874_2017_306_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/78012df2c6cc/12874_2017_306_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/f4c2aca828bb/12874_2017_306_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ace2/5297137/2002c6f23e19/12874_2017_306_Fig4_HTML.jpg

相似文献

1
Utilising identifier error variation in linkage of large administrative data sources.利用大型行政数据源链接中的标识符错误变异。
BMC Med Res Methodol. 2017 Feb 7;17(1):23. doi: 10.1186/s12874-017-0306-8.
2
Linking education and hospital data in England: linkage process and quality.链接英格兰的教育和医院数据:链接过程和质量。
Int J Popul Data Sci. 2021 Sep 16;6(1):1671. doi: 10.23889/ijpds.v6i1.1671. eCollection 2021.
3
Evaluating bias due to data linkage error in electronic healthcare records.评估电子医疗记录中因数据链接错误导致的偏差。
BMC Med Res Methodol. 2014 Mar 5;14:36. doi: 10.1186/1471-2288-14-36.
4
Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records.将假名化算法应用于儿科重症监护记录时医院管理数据中的数据链接错误。
BMJ Open. 2015 Aug 21;5(8):e008118. doi: 10.1136/bmjopen-2015-008118.
5
Probabilistic linkage to enhance deterministic algorithms and reduce data linkage errors in hospital administrative data.概率链接法用于增强确定性算法并减少医院管理数据中的数据链接错误。
J Innov Health Inform. 2017 Jun 30;24(2):891. doi: 10.14236/jhi.v24i2.891.
6
Sociodemographic differences in linkage error: an examination of four large-scale datasets.连锁错误中的社会人口学差异:对四个大规模数据集的考察
BMC Health Serv Res. 2018 Sep 3;18(1):678. doi: 10.1186/s12913-018-3495-x.
7
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
8
Impact of matching error on linked mortality outcome in a data linkage of secondary mental health data with Hospital Episode Statistics (HES) and mortality records in South East London: a cross-sectional study.匹配错误对伦敦东南部二级精神卫生数据与医院入院统计数据(HES)和死亡率记录进行数据链接后得出的死亡率结果的影响:一项横断面研究。
BMJ Open. 2020 Jul 7;10(7):e035884. doi: 10.1136/bmjopen-2019-035884.
9
Developing a national birth cohort for child health research using a hospital admissions database in England: The impact of changes to data collection practices.利用英国住院数据库开展儿童健康研究的全国出生队列研究:数据收集实践变化的影响。
PLoS One. 2020 Dec 15;15(12):e0243843. doi: 10.1371/journal.pone.0243843. eCollection 2020.
10
Linking Data for Mothers and Babies in De-Identified Electronic Health Data.在去识别化电子健康数据中关联母婴数据
PLoS One. 2016 Oct 20;11(10):e0164667. doi: 10.1371/journal.pone.0164667. eCollection 2016.

引用本文的文献

1
Generating synthetic identifiers to support development and evaluation of data linkage methods.生成合成标识符以支持数据链接方法的开发和评估。
Int J Popul Data Sci. 2024 Jul 1;9(1):2389. doi: 10.23889/ijpds.v9i1.2389. eCollection 2024.
2
Microsimulation of an educational attainment register to predict future record linkage quality.基于教育程度登记的微观模拟预测未来的记录链接质量。
Int J Popul Data Sci. 2023 Apr 3;8(1):2122. doi: 10.23889/ijpds.v8i1.2122. eCollection 2023.
3
Virtual patient identifier (vPID): Improving patient traceability using anonymized identifiers in Japanese healthcare insurance claims database.

本文引用的文献

1
Who comes back with what: a retrospective database study on reasons for emergency readmission to hospital in children and young people in England.谁因何而归:一项关于英格兰儿童和青少年紧急再次入院原因的回顾性数据库研究。
Arch Dis Child. 2016 Aug;101(8):714-8. doi: 10.1136/archdischild-2015-309290. Epub 2016 Apr 25.
2
Probabilistic record linkage.概率性记录链接
Int J Epidemiol. 2016 Jun;45(3):954-64. doi: 10.1093/ije/dyv322. Epub 2015 Dec 20.
3
The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement.
虚拟患者标识符(vPID):在日本医疗保险理赔数据库中使用匿名标识符提高患者可追溯性。
Heliyon. 2023 May 12;9(5):e16209. doi: 10.1016/j.heliyon.2023.e16209. eCollection 2023 May.
4
A framework for a consistent and reproducible evaluation of manual review for patient matching algorithms.用于对患者匹配算法的人工审核进行一致且可重现的评估的框架。
J Am Med Inform Assoc. 2022 Nov 14;29(12):2105-2109. doi: 10.1093/jamia/ocac175.
5
Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach.塔库西生物标记物-BLAST:通过 DNA 编码方法实现大规模健康相关行政数据库的快速准确记录链接。
PeerJ. 2022 Jul 11;10:e13507. doi: 10.7717/peerj.13507. eCollection 2022.
6
Linkage of multiple electronic health record datasets using a 'spine linkage' approach compared with all 'pairwise linkages'.采用“脊柱链接”方法对多个电子健康记录数据集进行链接,与所有“两两链接”相比。
Int J Epidemiol. 2023 Feb 8;52(1):214-226. doi: 10.1093/ije/dyac130.
7
Linking data on women in public family law court proceedings concerning their children to mental health service records in South London.将有关妇女在涉及子女的公共家庭法庭诉讼中的数据与伦敦南部的心理健康服务记录相关联。
Int J Popul Data Sci. 2021 Feb 24;6(1):1385. doi: 10.23889/ijpds.v5i2.1385.
8
Comparing record linkage software programs and algorithms using real-world data.使用真实世界的数据比较记录链接软件程序和算法。
PLoS One. 2019 Sep 24;14(9):e0221459. doi: 10.1371/journal.pone.0221459. eCollection 2019.
9
Demystifying probabilistic linkage: Common myths and misconceptions.揭开概率关联的神秘面纱:常见的误解与错误观念。
Int J Popul Data Sci. 2018 Jan 10;3(1):410. doi: 10.23889/ijpds.v3i1.410.
10
Sociodemographic differences in linkage error: an examination of four large-scale datasets.连锁错误中的社会人口学差异:对四个大规模数据集的考察
BMC Health Serv Res. 2018 Sep 3;18(1):678. doi: 10.1186/s12913-018-3495-x.
使用常规收集的健康数据进行研究的报告(RECORD)声明
PLoS Med. 2015 Oct 6;12(10):e1001885. doi: 10.1371/journal.pmed.1001885. eCollection 2015 Oct.
4
Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies.使用公共卫生与流行病学研究增强匹配系统的概率性链接的准确性
PLoS One. 2015 Aug 24;10(8):e0136179. doi: 10.1371/journal.pone.0136179. eCollection 2015.
5
Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records.将假名化算法应用于儿科重症监护记录时医院管理数据中的数据链接错误。
BMJ Open. 2015 Aug 21;5(8):e008118. doi: 10.1136/bmjopen-2015-008118.
6
Identifying Possible False Matches in Anonymized Hospital Administrative Data without Patient Identifiers.在无患者标识符的匿名医院管理数据中识别可能的错误匹配项。
Health Serv Res. 2015 Aug;50(4):1162-78. doi: 10.1111/1475-6773.12272. Epub 2014 Dec 18.
7
A new method for assessing how sensitivity and specificity of linkage studies affects estimation.一种评估连锁研究的敏感性和特异性如何影响估计的新方法。
PLoS One. 2014 Jul 28;9(7):e103690. doi: 10.1371/journal.pone.0103690. eCollection 2014.
8
Evaluating bias due to data linkage error in electronic healthcare records.评估电子医疗记录中因数据链接错误导致的偏差。
BMC Med Res Methodol. 2014 Mar 5;14:36. doi: 10.1186/1471-2288-14-36.
9
A practical approach for incorporating dependence among fields in probabilistic record linkage.一种实用的方法,用于在概率记录链接中纳入字段之间的依赖关系。
BMC Med Inform Decis Mak. 2013 Aug 30;13:97. doi: 10.1186/1472-6947-13-97.
10
Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort.瑞士国家队列中未关联死亡和编码变更对死亡率趋势的影响。
BMC Med Inform Decis Mak. 2013 Jan 4;13:1. doi: 10.1186/1472-6947-13-1.