使用基于实验模板的模拟数据对16S微生物组测序数据的差异丰度测试进行基准测试。

Benchmarking Differential Abundance Tests for 16S microbiome sequencing data using simulated data based on experimental templates.

作者信息

Kohnert Eva, Kreutz Clemens

机构信息

Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany.

出版信息

PLoS One. 2025 May 19;20(5):e0321452. doi: 10.1371/journal.pone.0321452. eCollection 2025.

DOI:10.1371/journal.pone.0321452

PMID:40388544

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12088514/

Abstract

Differential abundance (DA) analysis of metagenomic microbiome data is essential for understanding microbial community dynamics across various environments and hosts. Identifying microorganisms that differ significantly in abundance between conditions (e.g., health vs. disease) is crucial for insights into environmental adaptations, disease development, and host health. However, the statistical interpretation of microbiome data is challenged by inherent sparsity and compositional nature, necessitating tailored DA methods. This benchmarking study aims to simulate synthetic 16S microbiome data using metaSPARSim (Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. MetaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20:416. https://doi.org/10.1186/s12859-019-2882-6 PMID: 31757204) MIDASim (He M, Zhao N, Satten GA. MIDASim: a fast and simple simulator for realistic microbiome data. Available from: https://doi.org/10.1101/2023.03.23.533996), and sparseDOSSA2 (Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, et al. A statistical model for describing and simulating microbial community profiles. PLOS Comput Biol. 2021;17(9):e1008913. https://doi.org/10.1371/journal.pcbi.1008913 PMID: 34516542) , leveraging 38 real-world experimental templates (S3 Table) previously utilized in a benchmark study comparing DA tools. These datasets, drawn from diverse environments such as human gut, soil, and marine habitats, serve as the foundation for our simulation efforts. We employ the same 14 DA tests that were previously used with the same experimental data in benchmark studies alongside 8 DA tests that were developed subsequently. Initially, we will generate synthetic data closely mirroring the experimental datasets, incorporating a known truth to cover a broad range of real-world data characteristics. This approach allows us to assess the ability of DA methods to recover known true differential abundances. We will further simulate datasets by altering sparsity, effect size, and sample size, thus creating a comprehensive collection for applying the 22 DA tests. The outcomes, focusing on sensitivities and specificities, will provide insights into the performance of DA tests and their dependencies on sparsity, effect size, and sample size. Additionally, we will calculate data characteristics (S1 and S2 Table) for each simulated dataset and use a multiple regression to identify informative data characteristics influencing test performance. Our prior study, where we used simulated data without incorporating a known truth, demonstrated the feasibility of using synthetic data to validate experimental findings. This current study aims to enhance our understanding by systematically evaluating the impact of known truth incorporation on DA test performance, thereby providing further information for the selection and application of DA methods in microbiome research.

摘要

宏基因组微生物组数据的差异丰度（DA）分析对于理解不同环境和宿主中的微生物群落动态至关重要。识别在不同条件（如健康与疾病）下丰度有显著差异的微生物，对于深入了解环境适应性、疾病发展和宿主健康至关重要。然而，微生物组数据的统计解释受到固有稀疏性和组成性质的挑战，因此需要量身定制的DA方法。本基准研究旨在使用metaSPARSim（帕图齐I，巴鲁佐G，洛萨索C，里奇A，迪卡米洛B。MetaSPARSim：一种16S rRNA基因测序计数数据模拟器。BMC生物信息学。2019；20：416。https://doi.org/10.1186/s12859-019-2882-6 PMID：31757204）、MIDASim（何M，赵N，萨滕GA。MIDASim：一种用于生成逼真微生物组数据的快速简单模拟器。可从：https://doi.org/10.1101/2023.03.23.533996获取）和sparseDOSSA2（马S，任B，马利克H，文YS，施瓦格E，马哈詹S等。一种描述和模拟微生物群落概况的统计模型。PLOS计算生物学。2021；17（9）：e1008913。https://doi.org/10.1371/journal.pcbi.1008913 PMID：34516542）模拟合成16S微生物组数据，利用先前在一项比较DA工具的基准研究中使用的38个真实世界实验模板（S3表）。这些数据集来自人类肠道、土壤和海洋栖息地等不同环境，是我们模拟工作的基础。我们采用与基准研究中相同的14个DA测试，这些测试曾用于相同的实验数据，同时还采用了随后开发的8个DA测试。最初，我们将生成紧密反映实验数据集的合成数据，并纳入已知真值以涵盖广泛的真实世界数据特征。这种方法使我们能够评估DA方法恢复已知真实差异丰度的能力。我们将通过改变稀疏性、效应大小和样本大小进一步模拟数据集，从而创建一个全面的集合来应用这22个DA测试。结果将聚焦于敏感性和特异性，这将为DA测试的性能及其对稀疏性、效应大小和样本大小的依赖性提供见解。此外，我们将为每个模拟数据集计算数据特征（S1和S2表），并使用多元回归来识别影响测试性能的信息性数据特征。我们之前的研究在未纳入已知真值的情况下使用模拟数据，证明了使用合成数据验证实验结果的可行性。本研究旨在通过系统评估纳入已知真值对DA测试性能的影响来加深我们的理解，从而为微生物组研究中DA方法的选择和应用提供更多信息。