Pereira Ricardo Cardoso, Abreu Pedro Henriques, Rodrigues Pedro Pereira
IEEE J Biomed Health Inform. 2022 Aug;26(8):4218-4227. doi: 10.1109/JBHI.2022.3172656. Epub 2022 Aug 11.
Missing data can pose severe consequences in critical contexts, such as clinical research based on routinely collected healthcare data. This issue is usually handled with imputation strategies, but these tend to produce poor and biased results under the Missing Not At Random (MNAR) mechanism. A recent trend that has been showing promising results for MNAR is the use of generative models, particularly Variational Autoencoders. However, they have a limitation: the imputed values are the result of a single sample, which can be biased. To tackle it, an extension to the Variational Autoencoder that uses a partial multiple imputation procedure is introduced in this work. The proposed method was compared to 8 state-of-the-art imputation strategies, in an experimental setup with 34 datasets from the medical context, injected with the MNAR mechanism (10% to 80% rates). The results were evaluated through the Mean Absolute Error, with the new method being the overall best in 71% of the datasets, significantly outperforming the remaining ones, particularly for high missing rates. Finally, a case study of a classification task with heart failure data was also conducted, where this method induced improvements in 50% of the classifiers.
在关键环境中,缺失数据可能会带来严重后果,比如基于常规收集的医疗保健数据进行的临床研究。这个问题通常通过插补策略来处理,但在非随机缺失(MNAR)机制下,这些策略往往会产生不佳且有偏差的结果。最近,一种对MNAR显示出有前景结果的趋势是使用生成模型,特别是变分自编码器。然而,它们有一个局限性:插补值是单个样本的结果,可能存在偏差。为了解决这个问题,本文引入了一种对变分自编码器的扩展,该扩展使用了部分多重插补程序。在一个实验设置中,将所提出的方法与8种先进的插补策略进行了比较,该实验设置使用了34个来自医学背景的数据集,并注入了MNAR机制(缺失率为10%至80%)。通过平均绝对误差对结果进行评估,新方法在71%的数据集上总体表现最佳,显著优于其他方法,尤其是对于高缺失率的情况。最后,还进行了一个使用心力衰竭数据的分类任务的案例研究,该方法在50%的分类器中带来了改进。