He Mengyu, Zhao Ni, Satten Glen A
bioRxiv. 2024 Mar 27:2023.03.23.533996. doi: 10.1101/2023.03.23.533996.
Advances in sequencing technology has led to the discovery of associations between the human microbiota and many diseases, conditions, and traits. With the increasing availability of microbiome data, many statistical methods have been developed for studying these associations. The growing number of newly developed methods highlights the need for simple, rapid, and reliable methods to simulate realistic microbiome data, which is essential for validating and evaluating the performance of these methods. However, generating realistic microbiome data is challenging due to the complex nature of microbiome data, which feature correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for simulating microbiome data are deficient in their ability to capture these important features of microbiome data, or can require exorbitant computational time.
We develop MIDASim ( MI crobiome DA ta Sim ulator), a fast and simple approach for simulating realistic microbiome data that reproduces the distributional and correlation structure of a template microbiome dataset. MIDASim is a two-step approach. The first step generates correlated binary indicators that represent the presence-absence status of all taxa, and the second step generates relative abundances and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to account for the taxon-taxon correlations. In the second step, MIDASim can operate in both a nonparametric and parametric mode. In the nonparametric mode, the Gaussian copula uses the empirical distribution of relative abundances for the marginal distributions. In the parametric mode, an inverse generalized gamma distribution is used in place of the empirical distribution.
We demonstrate improved performance of MIDASim relative to other existing methods using gut and vaginal data. MIDASim showed superior performance by PER-MANOVA and in terms of alpha diversity and beta dispersion in either parametric or nonparametric mode. We also show how MIDASim in parametric mode can be used to assess the performance of methods for finding differentially abundant taxa in a compositional model.
MIDASim is easy to implement, flexible and suitable for most microbiome data simulation situations. MIDASim has three major advantages. First, MIDASim performs better in reproducing the distributional features of real data compared to other methods at both presence-absence level and relative-abundance level. MIDASim-simulated data are more similar to the template data than competing methods, as quantified using a variety of measures. Second, MIDASim makes few distributional assumptions for the relative abundances, and thus can easily accommodate complex distributional features in real data. Third, MIDASim is computationally efficient and can be used to simulate large microbiome datasets.
测序技术的进步促使人们发现了人类微生物群与许多疾病、状况及特征之间的关联。随着微生物组数据的获取日益增多,已开发出多种统计方法来研究这些关联。新开发方法的数量不断增加,凸显出需要简单、快速且可靠的方法来模拟逼真的微生物组数据,这对于验证和评估这些方法的性能至关重要。然而,由于微生物组数据的复杂性质,包括分类单元之间的相关性、稀疏性、过度离散性和组成性,生成逼真的微生物组数据具有挑战性。当前模拟微生物组数据的方法在捕捉微生物组数据的这些重要特征方面存在不足,或者可能需要大量的计算时间。
我们开发了MIDASim(微生物组数据模拟器),这是一种快速且简单的方法,用于模拟逼真的微生物组数据,该数据可再现模板微生物组数据集的分布和相关结构。MIDASim是一种两步法。第一步生成表示所有分类单元存在与否状态相关的二元指标,第二步利用高斯相依函数考虑分类单元之间的相关性,为第一步中被认为存在的分类单元生成相对丰度和计数。在第二步中,MIDASim可以在非参数和参数模式下运行。在非参数模式下,高斯相依函数使用相对丰度的经验分布作为边际分布。在参数模式下,使用逆广义伽马分布代替经验分布。
我们使用肠道和阴道数据证明了MIDASim相对于其他现有方法具有更好的性能。在参数或非参数模式下,MIDASim通过PER-MANOVA以及在α多样性和β离散度方面均表现出卓越的性能。我们还展示了参数模式下的MIDASim如何用于评估在组成模型中寻找差异丰富分类单元的方法的性能。
MIDASim易于实现、灵活且适用于大多数微生物组数据模拟情况。MIDASim具有三个主要优点。首先,与其他方法相比,MIDASim在存在与否水平和相对丰度水平上再现真实数据的分布特征方面表现更好。使用各种度量进行量化时,MIDASim模拟的数据比竞争方法生成的数据更类似于模板数据。其次,MIDASim对相对丰度几乎不做分布假设,因此能够轻松适应真实数据中的复杂分布特征。第三,MIDASim计算效率高,可用于模拟大型微生物组数据集。