Department of Information Engineering (DII), Università Politecnica delle Marche, Ancona, Italy.
Grenoble Informatics Laboratory, Université Grenoble Alpes, Saint-Martin-d'Hères, France.
Comput Biol Med. 2023 Sep;163:107188. doi: 10.1016/j.compbiomed.2023.107188. Epub 2023 Jun 22.
The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.
缺失数据机制是机器学习 (ML) 和生物医学信息学领域的一个相关问题。真实世界的电子健康记录 (EHR) 数据集包含多个缺失值,因此在预测器矩阵中呈现出高度的时空稀疏性。现有技术中的几种方法试图通过提出不同的数据插补策略来解决这个问题,这些策略 (i) 通常与 ML 模型无关,(ii) 不是为 EHR 数据设计的,因为实验室检查在时间上不是均匀规定的,缺失值的百分比很高,(iii) 仅利用观察特征的单变量和线性信息。我们的论文提出了一种基于临床条件生成对抗网络 (ccGAN) 的数据插补策略,该策略能够通过利用跨患者的非线性和多变量信息来插补缺失值。与其他基于 GAN 的数据插补方法不同,我们的方法通过将插补策略与可观察值和完全注释的值进行条件化,明确处理常规 EHR 数据的高缺失率问题。我们在一个真实的多糖尿病中心数据集上,根据插补 (与最佳竞争对手相比约有 19.79%的增益) 和预测性能 (与最佳竞争对手相比高达 1.60%的增益),证明了 ccGAN 相对于其他最新方法的统计学意义。我们还在另一个基准 EHR 数据集上,在不同的缺失率下 (在最高缺失率条件下与最佳竞争对手相比有 1.61%的增益),证明了它的稳健性。