Suppr超能文献

电子患者数据去重中的缺失值。

Missing values in deduplication of electronic patient data.

机构信息

Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University, Mainz, Germany.

出版信息

J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.

Abstract

INTRODUCTION

Systematic approaches to dealing with missing values in record linkage are still lacking. This article compares the ad-hoc treatment of unknown comparison values as 'unequal' with other and more sophisticated approaches. An empirical evaluation was conducted of the methods on real-world data as well as on simulated data based on them.

MATERIAL AND METHODS

Cancer registry data and artificial data with increased numbers of missing values in a relevant variable are used for empirical comparisons. As a classification method, classification and regression trees were used. On the resulting binary comparison patterns, the following strategies for dealing with missingness are considered: imputation with unique values, sample-based imputation, reduced-model classification and complete-case induction. These approaches are evaluated according to the number of training data needed for induction and the F-scores achieved.

RESULTS

The evaluations reveal that unique value imputation leads to the best results. Imputation with zero is preferred to imputation with 0.5, although the latter shows the highest median F-scores. Imputation with zero needs considerably less training data, it shows only slightly worse results and simplifies the computation by maintaining the binary structure of the data.

CONCLUSIONS

The results support the ad-hoc solution for missing values 'replace NA by the value of inequality'. This conclusion is based on a limited amount of data and on a specific deduplication method. Nevertheless, the authors are confident that their results should be confirmed by other empirical analyses and applications.

摘要

简介

系统的方法来处理记录链接中的缺失值仍然缺乏。本文比较了将未知比较值作为“不等”的特殊处理方法与其他更复杂的方法。对真实世界的数据以及基于这些数据的模拟数据进行了实证评估。

材料和方法

使用癌症登记数据和在相关变量中具有更多缺失值的人工数据进行实证比较。作为一种分类方法,使用分类和回归树。对于生成的二进制比较模式,考虑了以下处理缺失值的策略:使用唯一值进行插补、基于样本的插补、简化模型分类和完全案例归纳。根据归纳所需的训练数据数量和获得的 F 分数评估这些方法。

结果

评估结果表明,唯一值插补可获得最佳结果。零插补优于 0.5 插补,尽管后者显示出最高的中位数 F 分数。零插补需要的训练数据要少得多,它只显示出稍微差一些的结果,并通过保持数据的二进制结构简化计算。

结论

结果支持“用不等值替换缺失值‘NA’”的特殊解决方案。该结论基于有限数量的数据和特定的去重方法。然而,作者有信心他们的结果应该通过其他实证分析和应用得到证实。

相似文献

1
Missing values in deduplication of electronic patient data.电子患者数据去重中的缺失值。
J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.

引用本文的文献

1
Probabilistic Record Linkage of 2 Gun Violence Datasets.两个枪支暴力数据集的概率性记录链接
Public Health Rep. 2025 Jul 4:333549251342988. doi: 10.1177/00333549251342988.
7
Clinical research informatics: a conceptual perspective.临床研究信息学:概念视角。
J Am Med Inform Assoc. 2012 Jun;19(e1):e36-42. doi: 10.1136/amiajnl-2012-000968. Epub 2012 Apr 20.

本文引用的文献

2
Evaluation of record linkage methods for iterative insertions.迭代插入的记录链接方法评估
Methods Inf Med. 2009;48(5):429-37. doi: 10.3414/ME9238. Epub 2009 Aug 20.
4
An empirical comparison of record linkage procedures.记录链接程序的实证比较。
Stat Med. 2002 May 30;21(10):1485-96. doi: 10.1002/sim.1147.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验