电子患者数据去重中的缺失值。

Missing values in deduplication of electronic patient data.

机构信息

Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University, Mainz, Germany.

出版信息

J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.

DOI:10.1136/amiajnl-2011-000461

PMID:22003173

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3392851/

Abstract

INTRODUCTION

Systematic approaches to dealing with missing values in record linkage are still lacking. This article compares the ad-hoc treatment of unknown comparison values as 'unequal' with other and more sophisticated approaches. An empirical evaluation was conducted of the methods on real-world data as well as on simulated data based on them.

MATERIAL AND METHODS

Cancer registry data and artificial data with increased numbers of missing values in a relevant variable are used for empirical comparisons. As a classification method, classification and regression trees were used. On the resulting binary comparison patterns, the following strategies for dealing with missingness are considered: imputation with unique values, sample-based imputation, reduced-model classification and complete-case induction. These approaches are evaluated according to the number of training data needed for induction and the F-scores achieved.

RESULTS

The evaluations reveal that unique value imputation leads to the best results. Imputation with zero is preferred to imputation with 0.5, although the latter shows the highest median F-scores. Imputation with zero needs considerably less training data, it shows only slightly worse results and simplifies the computation by maintaining the binary structure of the data.

CONCLUSIONS

The results support the ad-hoc solution for missing values 'replace NA by the value of inequality'. This conclusion is based on a limited amount of data and on a specific deduplication method. Nevertheless, the authors are confident that their results should be confirmed by other empirical analyses and applications.

摘要

简介

系统的方法来处理记录链接中的缺失值仍然缺乏。本文比较了将未知比较值作为“不等”的特殊处理方法与其他更复杂的方法。对真实世界的数据以及基于这些数据的模拟数据进行了实证评估。

材料和方法

使用癌症登记数据和在相关变量中具有更多缺失值的人工数据进行实证比较。作为一种分类方法，使用分类和回归树。对于生成的二进制比较模式，考虑了以下处理缺失值的策略：使用唯一值进行插补、基于样本的插补、简化模型分类和完全案例归纳。根据归纳所需的训练数据数量和获得的 F 分数评估这些方法。

结果

评估结果表明，唯一值插补可获得最佳结果。零插补优于 0.5 插补，尽管后者显示出最高的中位数 F 分数。零插补需要的训练数据要少得多，它只显示出稍微差一些的结果，并通过保持数据的二进制结构简化计算。

结论

结果支持“用不等值替换缺失值‘NA’”的特殊解决方案。该结论基于有限数量的数据和特定的去重方法。然而，作者有信心他们的结果应该通过其他实证分析和应用得到证实。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

电子患者数据去重中的缺失值。

Missing values in deduplication of electronic patient data.

机构信息

出版信息

INTRODUCTION

MATERIAL AND METHODS

RESULTS

CONCLUSIONS

简介

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

电子患者数据去重中的缺失值。

Missing values in deduplication of electronic patient data.

机构信息

出版信息

INTRODUCTION

MATERIAL AND METHODS

RESULTS

CONCLUSIONS

简介

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献