Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, United States.
College of Nursing, University of South Florida, Tampa, FL 33620, United States.
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad286.
The human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods.
This work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.
DeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen.
越来越多的证据表明,人类微生物组与各种疾病有关,对人类健康有深远的影响。由于微生物组随时间的组成变化与疾病和临床结果有关,因此应该在纵向研究中进行微生物组分析。然而,由于样本量有限,不同受试者的时间点数量不同,大量数据无法被利用,直接影响了分析结果的质量。深度生成模型已被提出以解决这个数据不足的问题。具体来说,生成对抗网络(GAN)已成功用于数据扩充,以改善预测任务。最近的研究还表明,与传统的插补方法相比,基于 GAN 的模型在多元时间序列数据集的缺失值插补方面具有更好的性能。
本研究提出了 DeepMicroGen,这是一种基于双向递归神经网络的 GAN 模型,针对观察值之间的时间关系进行训练,以对纵向研究中的缺失微生物样本进行插补。DeepMicroGen 优于标准基线插补方法,对于模拟和真实数据集,均显示出最低的平均绝对误差。最后,通过对用于训练分类器的不完整纵向数据集进行插补,该模型提高了过敏的预测临床结果。
DeepMicroGen 可在 https://github.com/joungmin-choi/DeepMicroGen 上公开获取。