Department of Mathematical Modelling, Statistics and Bioinformatics at Ghent University, Belgium.
Center for Statistics at Hasselt University, Belgium.
Brief Bioinform. 2019 Jan 18;20(1):210-221. doi: 10.1093/bib/bbx104.
High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods.
高通量测序技术使得人类微生物组的特征变得容易,但分析微生物组数据的统计方法仍处于起步阶段。差异丰度方法旨在检测细菌物种丰度与主体分组因素之间的关联。这些方法的结果对于确定微生物组作为预后或诊断生物标志物,或证明前药或抗生素药物的疗效非常重要。由于缺乏微生物组领域的基准研究,因此在统计方法的性能方面没有共识。我们通过广泛的参数和非参数模拟以及真实数据随机化算法比较了大量流行的方法。不同方法的结果是一致的,并且都指向假发现的惊人过剩。这对过去研究中发现的可靠性提出了很大的质疑,并危及微生物组实验的可重复性。为了进一步改进方法基准测试,我们引入了一种新的模拟工具,该工具可以根据任何单变量计数分布生成相关的计数数据;相关结构可以从实际数据中推断出来。大多数模拟研究忽略了物种之间的相关性,但我们的结果表明,这种相关性可能会对统计方法的性能产生负面影响。