• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

电子患者数据去重中的缺失值。

Missing values in deduplication of electronic patient data.

机构信息

Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University, Mainz, Germany.

出版信息

J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.

DOI:10.1136/amiajnl-2011-000461
PMID:22003173
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3392851/
Abstract

INTRODUCTION

Systematic approaches to dealing with missing values in record linkage are still lacking. This article compares the ad-hoc treatment of unknown comparison values as 'unequal' with other and more sophisticated approaches. An empirical evaluation was conducted of the methods on real-world data as well as on simulated data based on them.

MATERIAL AND METHODS

Cancer registry data and artificial data with increased numbers of missing values in a relevant variable are used for empirical comparisons. As a classification method, classification and regression trees were used. On the resulting binary comparison patterns, the following strategies for dealing with missingness are considered: imputation with unique values, sample-based imputation, reduced-model classification and complete-case induction. These approaches are evaluated according to the number of training data needed for induction and the F-scores achieved.

RESULTS

The evaluations reveal that unique value imputation leads to the best results. Imputation with zero is preferred to imputation with 0.5, although the latter shows the highest median F-scores. Imputation with zero needs considerably less training data, it shows only slightly worse results and simplifies the computation by maintaining the binary structure of the data.

CONCLUSIONS

The results support the ad-hoc solution for missing values 'replace NA by the value of inequality'. This conclusion is based on a limited amount of data and on a specific deduplication method. Nevertheless, the authors are confident that their results should be confirmed by other empirical analyses and applications.

摘要

简介

系统的方法来处理记录链接中的缺失值仍然缺乏。本文比较了将未知比较值作为“不等”的特殊处理方法与其他更复杂的方法。对真实世界的数据以及基于这些数据的模拟数据进行了实证评估。

材料和方法

使用癌症登记数据和在相关变量中具有更多缺失值的人工数据进行实证比较。作为一种分类方法,使用分类和回归树。对于生成的二进制比较模式,考虑了以下处理缺失值的策略:使用唯一值进行插补、基于样本的插补、简化模型分类和完全案例归纳。根据归纳所需的训练数据数量和获得的 F 分数评估这些方法。

结果

评估结果表明,唯一值插补可获得最佳结果。零插补优于 0.5 插补,尽管后者显示出最高的中位数 F 分数。零插补需要的训练数据要少得多,它只显示出稍微差一些的结果,并通过保持数据的二进制结构简化计算。

结论

结果支持“用不等值替换缺失值‘NA’”的特殊解决方案。该结论基于有限数量的数据和特定的去重方法。然而,作者有信心他们的结果应该通过其他实证分析和应用得到证实。

相似文献

1
Missing values in deduplication of electronic patient data.电子患者数据去重中的缺失值。
J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.
2
Active learning strategies for the deduplication of electronic patient data using classification trees.使用分类树对电子患者数据进行去重的主动学习策略。
J Biomed Inform. 2012 Oct;45(5):893-900. doi: 10.1016/j.jbi.2012.02.002. Epub 2012 Feb 28.
3
Imputation of missing values of tumour stage in population-based cancer registration.基于人群的癌症登记中肿瘤分期缺失值的推断。
BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129.
4
Comparison of Imputation Methods for Categorical Real-World Prostate Cancer Data with Natural Order.自然顺序下分类真实世界前列腺癌数据的插补方法比较。
Stud Health Technol Inform. 2024 Aug 22;316:1800-1804. doi: 10.3233/SHTI240780.
5
Recovery of information from multiple imputation: a simulation study.从多重填补中恢复信息:一项模拟研究。
Emerg Themes Epidemiol. 2012 Jun 13;9(1):3. doi: 10.1186/1742-7622-9-3.
6
Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理:数据分析
JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.
7
Benchmarking missing-values approaches for predictive models on health databases.健康数据库中预测模型缺失值处理方法的基准测试
Gigascience. 2022 Apr 15;11. doi: 10.1093/gigascience/giac013.
8
Approach to addressing missing data for electronic medical records and pharmacy claims data research.电子病历和药房报销数据研究中缺失数据的处理方法。
Pharmacotherapy. 2015 Apr;35(4):380-7. doi: 10.1002/phar.1569.
9
Evaluating the use of existing data sources, probabilistic linkage, and multiple imputation to build population-based injury databases across phases of trauma care.评估利用现有数据源、概率性链接和多重插补在创伤救治各阶段构建基于人群的伤害数据库。
Acad Emerg Med. 2012 Apr;19(4):469-80. doi: 10.1111/j.1553-2712.2012.01324.x.
10
Imputation of missing values for electronic health record laboratory data.电子健康记录实验室数据缺失值的插补
NPJ Digit Med. 2021 Oct 11;4(1):147. doi: 10.1038/s41746-021-00518-0.

引用本文的文献

1
Probabilistic Record Linkage of 2 Gun Violence Datasets.两个枪支暴力数据集的概率性记录链接
Public Health Rep. 2025 Jul 4:333549251342988. doi: 10.1177/00333549251342988.
2
Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records.超越医学统计学:电子健康记录中缺失数据处理的系统评价
Health Data Sci. 2024 Dec 4;4:0176. doi: 10.34133/hds.0176. eCollection 2024.
3
A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.一种使用费勒吉-桑特模型进行基于频率的记录链接的简单两步程序。
J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.
4
Methodology for linking Ryan White HIV/AIDS Program Services Report (RSR) client level data over multiple years.链接多年 Ryan White HIV/AIDS 计划服务报告 (RSR) 客户端级数据的方法。
PLoS One. 2020 Aug 21;15(8):e0237635. doi: 10.1371/journal.pone.0237635. eCollection 2020.
5
Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.电子健康记录数据库的优化双阈值实体解析——训练集大小与主动学习
AMIA Annu Symp Proc. 2013 Nov 16;2013:721-30. eCollection 2013.
6
A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation.在重复记录核对中定义人工审核数据集的确定性方法和概率性方法的基准比较。
J Am Med Inform Assoc. 2014 Jan-Feb;21(1):97-104. doi: 10.1136/amiajnl-2013-001744. Epub 2013 May 23.
7
Clinical research informatics: a conceptual perspective.临床研究信息学:概念视角。
J Am Med Inform Assoc. 2012 Jun;19(e1):e36-42. doi: 10.1136/amiajnl-2012-000968. Epub 2012 Apr 20.

本文引用的文献

1
Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage.模拟数据集的结果:概率记录链接优于确定性记录链接。
J Clin Epidemiol. 2011 May;64(5):565-72. doi: 10.1016/j.jclinepi.2010.05.008. Epub 2010 Oct 16.
2
Evaluation of record linkage methods for iterative insertions.迭代插入的记录链接方法评估
Methods Inf Med. 2009;48(5):429-37. doi: 10.3414/ME9238. Epub 2009 Aug 20.
3
Evaluation of data quality in the cancer registry: principles and methods Part II. Completeness.癌症登记处数据质量评估:原则与方法 第二部分。完整性。
Eur J Cancer. 2009 Mar;45(5):756-64. doi: 10.1016/j.ejca.2008.11.033. Epub 2009 Jan 6.
4
An empirical comparison of record linkage procedures.记录链接程序的实证比较。
Stat Med. 2002 May 30;21(10):1485-96. doi: 10.1002/sim.1147.
5
Evaluation of the effect of breast cancer screening by record linkage with the cancer registry, The Netherlands.通过与荷兰癌症登记处的记录链接评估乳腺癌筛查的效果。
J Med Screen. 1998;5(1):37-41. doi: 10.1136/jms.5.1.37.
6
Quartiles, quintiles, centiles, and other quantiles.四分位数、五分位数、百分位数及其他分位数。
BMJ. 1994 Oct 15;309(6960):996. doi: 10.1136/bmj.309.6960.996.
7
The art and science of record linkage: methods that work with few identifiers.记录链接的艺术与科学:适用于少量标识符的方法。
Comput Biol Med. 1986;16(1):45-57. doi: 10.1016/0010-4825(86)90061-2.