Liu Chi, Mansoldo Felipe R P, Li Hankang, Vermelho Alane Beatriz, Zeng Raymond Jianxiong, Li Xiangzhen, Yao Minjie
Engineering Research Center of Soil Remediation of Fujian Province University, College of Resources and Environment, Fujian Agriculture and Forestry University, Fuzhou, China.
Bioinovar Laboratory, General Microbiology Department, Institute of Microbiology Paulo de Goes, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.
Nat Protoc. 2025 Aug 6. doi: 10.1038/s41596-025-01239-4.
The increasing complexity of experimental designs and the volume of data in the microbiome field, along with the diversification of omics data types, pose substantial challenges to statistical analysis and visualization. Here we present a step-by-step protocol based on the R microeco package ( https://github.com/ChiLiubio/microeco ) that details the statistical analysis and visualization of microbiome data. The omics data types shown consist of amplicon sequencing data, metagenomic sequencing data and nontargeted metabolomics data. The analysis of amplicon sequencing data specifically involves data preprocessing and normalization, core taxa, alpha diversity, beta diversity, differential abundance testing and machine learning. We consider various data analysis scenarios in each section to exhibit the comprehensiveness of the protocol. We emphasize that different normalized data produced by various methods are selected for subsequent analysis of each part based on the best analytical practices. Additionally, in the differential abundance test analysis, we adopt parametric community simulation to enable the performance evaluation of various testing approaches. For the analysis of metagenomic data, the focus is on how bioinformatic analysis data are read and preprocessed, which refers to the major usage differences from amplicon sequencing data. For metabolomics data, we mainly demonstrate the differential test, machine learning and association analysis with microbial abundances. To address some complex analyses, this protocol extensively combines different types of methods to build an analysis pipeline. This protocol is more comprehensive and scalable compared with alternative methods. The provided R codes can run in about 6 h on a laptop computer.
微生物组领域实验设计的日益复杂、数据量的增加,以及组学数据类型的多样化,给统计分析和可视化带来了巨大挑战。在此,我们基于R语言的microeco软件包(https://github.com/ChiLiubio/microeco)提供了一个详细的分步方案,该方案详述了微生物组数据的统计分析和可视化。所展示的组学数据类型包括扩增子测序数据、宏基因组测序数据和非靶向代谢组学数据。扩增子测序数据的分析具体涉及数据预处理和标准化、核心分类群、α多样性、β多样性、差异丰度检验和机器学习。我们在每个部分都考虑了各种数据分析场景,以展示该方案的全面性。我们强调,根据最佳分析实践,为每个部分的后续分析选择通过各种方法产生的不同标准化数据。此外,在差异丰度检验分析中,我们采用参数化群落模拟来评估各种检验方法的性能。对于宏基因组数据的分析,重点在于如何读取和预处理生物信息分析数据,这指的是与扩增子测序数据的主要使用差异。对于代谢组学数据,我们主要展示差异检验、机器学习以及与微生物丰度的关联分析。为了解决一些复杂分析,本方案广泛结合了不同类型的方法来构建分析流程。与其他方法相比,本方案更全面且具有可扩展性。所提供的R代码在笔记本电脑上大约6小时即可运行。