University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA.
Southern Methodist University, Department of Statistical Science, 3225 Daniel Ave, Dallas, TX 75275, USA.
Gigascience. 2021 Feb 5;10(2). doi: 10.1093/gigascience/giab005.
Trillions of microbes inhabit the human body and have a profound effect on human health. The recent development of metagenome-wide association studies and other quantitative analysis methods accelerate the discovery of the associations between human microbiome and diseases. To assess the strengths and limitations of these analytical tools, simulating realistic microbiome datasets is critically important. However, simulating the real microbiome data is challenging because it is difficult to model their correlation structure using explicit statistical models.
To address the challenge of simulating realistic microbiome data, we designed a novel simulation framework termed MB-GAN, by using a generative adversarial network (GAN) and utilizing methodology advancements from the deep learning community. MB-GAN can automatically learn from given microbial abundances and compute simulated abundances that are indistinguishable from them. In practice, MB-GAN showed the following advantages. First, MB-GAN avoids explicit statistical modeling assumptions, and it only requires real datasets as inputs. Second, unlike the traditional GANs, MB-GAN is easily applicable and can converge efficiently.
By applying MB-GAN to a case-control gut microbiome study of 396 samples, we demonstrated that the simulated data and the original data had similar first-order and second-order properties, including sparsity, diversities, and taxa-taxa correlations. These advantages are suitable for further microbiome methodology development where high-fidelity microbiome data are needed.
数以万亿计的微生物栖息在人体内,对人类健康有着深远的影响。元基因组关联研究和其他定量分析方法的最新发展加速了人类微生物组与疾病之间关联的发现。为了评估这些分析工具的优缺点,模拟真实的微生物组数据集至关重要。然而,模拟真实的微生物组数据具有挑战性,因为很难使用显式统计模型来模拟它们的相关结构。
为了解决模拟真实微生物组数据的挑战,我们设计了一种新颖的模拟框架,称为 MB-GAN,它使用生成对抗网络(GAN)并利用深度学习社区的方法进步。MB-GAN 可以自动从给定的微生物丰度中学习,并计算出与它们无法区分的模拟丰度。在实践中,MB-GAN 具有以下优势。首先,MB-GAN 避免了显式的统计建模假设,只需要真实数据集作为输入。其次,与传统的 GAN 不同,MB-GAN 易于应用且可以有效地收敛。
通过将 MB-GAN 应用于 396 个样本的病例对照肠道微生物组研究,我们证明了模拟数据和原始数据具有相似的一阶和二阶特性,包括稀疏性、多样性和分类群-分类群相关性。这些优势适用于需要高保真微生物组数据的进一步微生物组方法学开发。