Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University, Mainz, Germany.
J Am Med Inform Assoc. 2012 Jun;19(e1):e76-82. doi: 10.1136/amiajnl-2011-000461. Epub 2011 Oct 15.
Systematic approaches to dealing with missing values in record linkage are still lacking. This article compares the ad-hoc treatment of unknown comparison values as 'unequal' with other and more sophisticated approaches. An empirical evaluation was conducted of the methods on real-world data as well as on simulated data based on them.
Cancer registry data and artificial data with increased numbers of missing values in a relevant variable are used for empirical comparisons. As a classification method, classification and regression trees were used. On the resulting binary comparison patterns, the following strategies for dealing with missingness are considered: imputation with unique values, sample-based imputation, reduced-model classification and complete-case induction. These approaches are evaluated according to the number of training data needed for induction and the F-scores achieved.
The evaluations reveal that unique value imputation leads to the best results. Imputation with zero is preferred to imputation with 0.5, although the latter shows the highest median F-scores. Imputation with zero needs considerably less training data, it shows only slightly worse results and simplifies the computation by maintaining the binary structure of the data.
The results support the ad-hoc solution for missing values 'replace NA by the value of inequality'. This conclusion is based on a limited amount of data and on a specific deduplication method. Nevertheless, the authors are confident that their results should be confirmed by other empirical analyses and applications.
系统的方法来处理记录链接中的缺失值仍然缺乏。本文比较了将未知比较值作为“不等”的特殊处理方法与其他更复杂的方法。对真实世界的数据以及基于这些数据的模拟数据进行了实证评估。
使用癌症登记数据和在相关变量中具有更多缺失值的人工数据进行实证比较。作为一种分类方法,使用分类和回归树。对于生成的二进制比较模式,考虑了以下处理缺失值的策略:使用唯一值进行插补、基于样本的插补、简化模型分类和完全案例归纳。根据归纳所需的训练数据数量和获得的 F 分数评估这些方法。
评估结果表明,唯一值插补可获得最佳结果。零插补优于 0.5 插补,尽管后者显示出最高的中位数 F 分数。零插补需要的训练数据要少得多,它只显示出稍微差一些的结果,并通过保持数据的二进制结构简化计算。
结果支持“用不等值替换缺失值‘NA’”的特殊解决方案。该结论基于有限数量的数据和特定的去重方法。然而,作者有信心他们的结果应该通过其他实证分析和应用得到证实。