Department of Mathematical Sciences, University of Montana, 32 Campus Dr., Missoula, MT, USA.
Department of Biostatistics, West Virginia University, 1 Medical Center Dr., Morgantown, WV, USA.
Biostatistics. 2019 Oct 1;20(4):615-631. doi: 10.1093/biostatistics/kxy020.
The human microbiota composition is associated with a number of diseases including obesity, inflammatory bowel disease, and bacterial vaginosis. Thus, microbiome research has the potential to reshape clinical and therapeutic approaches. However, raw microbiome count data require careful pre-processing steps that take into account both the sparsity of counts and the large number of taxa that are being measured. Filtering is defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. Despite progress in the number and quality of filtering approaches, there is no consensus on filtering standards and quality assessment. This can adversely affect downstream analyses and reproducibility of results across platforms and software. We introduce PERFect, a novel permutation filtering approach designed to address two unsolved problems in microbiome data processing: (i) define and quantify loss due to filtering by implementing thresholds and (ii) introduce and evaluate a permutation test for filtering loss to provide a measure of excessive filtering. Methods are assessed on three "mock experiment" data sets, where the true taxa compositions are known, and are applied to two publicly available real microbiome data sets. The method correctly removes contaminant taxa in "mock" data sets, quantifies and visualizes the corresponding filtering loss, providing a uniform data-driven filtering criteria for real microbiome data sets. In real data analyses PERFect tends to remove more taxa than existing approaches; this likely happens because the method is based on an explicit loss function, uses statistically principled testing, and takes into account correlation between taxa. The PERFect software is freely available at https://github.com/katiasmirn/PERFect.
人类微生物群落组成与许多疾病有关,包括肥胖、炎症性肠病和细菌性阴道病。因此,微生物组研究有可能重塑临床和治疗方法。然而,原始微生物组计数数据需要仔细的预处理步骤,既要考虑计数的稀疏性,又要考虑正在测量的大量分类单元。过滤是指去除在少数样本中存在且在观察到的样本中计数较小的分类单元。尽管过滤方法在数量和质量上都有所进步,但在过滤标准和质量评估方面仍没有共识。这会对下游分析和跨平台及软件的结果重现性产生不利影响。我们引入了 PERFect,这是一种新的排列过滤方法,旨在解决微生物组数据处理中的两个未解决的问题:(i) 通过实施阈值来定义和量化过滤损失,以及 (ii) 引入和评估过滤损失的排列检验,以提供过度过滤的度量。该方法在三个“模拟实验”数据集上进行了评估,其中已知真实的分类单元组成,并应用于两个公开的可用真实微生物组数据集。该方法能够正确地从“模拟”数据集中去除污染物分类单元,量化和可视化相应的过滤损失,为真实微生物组数据集提供了统一的数据驱动的过滤标准。在真实数据分析中,PERFect 倾向于去除比现有方法更多的分类单元;这可能是因为该方法基于显式的损失函数,使用统计上有原则的检验,并考虑了分类单元之间的相关性。PERFect 软件可在 https://github.com/katiasmirn/PERFect 上免费获得。