Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Department of Statistics, University of Missouri, Columbia, MO, USA.
BMC Bioinformatics. 2021 Mar 17;22(1):126. doi: 10.1186/s12859-021-04061-3.
Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals (differential features between groups) from noise (features that are not differential between groups) becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as the probability of a type I error (false positive) increases dramatically with the large numbers of hypotheses. Moreover, the grouping effect of interest can be obscured by heterogeneity. These factors can incorrectly lead to the conclusion that there are no differences in the microbiome compositions.
We translate and represent the problem of identifying differential features, which are differential in two-group comparisons (e.g., treatment versus control), as a dynamic layout of separating the signal from its random background. More specifically, we progressively permute the grouping factor labels of the microbiome samples and perform multiple differential abundance tests in each scenario. We then compare the signal strength of the most differential features from the original data with their performance in permutations, and will observe a visually apparent decreasing trend if these features are true positives identified from the data. Simulations and applications on real data show that the proposed method creates a U-curve when plotting the number of significant features versus the proportion of mixing. The shape of the U-Curve can convey the strength of the overall association between the microbiome and the grouping factor. We also define a fragility index to measure the robustness of the discoveries. Finally, we recommend the identified features by comparing p-values in the observed data with p-values in the fully mixed data.
We have developed this into a user-friendly and efficient R-shiny tool with visualizations. By default, we use the Wilcoxon rank sum test to compute the p-values, since it is a robust nonparametric test. Our proposed method can also utilize p-values obtained from other testing methods, such as DESeq. This demonstrates the potential of the progressive permutation method to be extended to new settings.
在微生物组研究中,特征的识别是一项关键任务,但由于微生物数据具有高度的维数和异质性,这一任务变得复杂。数据的复杂性掩盖了一个问题,即如何将信号(组间差异特征)与噪声(组间无差异特征)区分开来。例如,在进行差异丰度检验时,由于假阳性(I 型错误)的概率随着假设数量的增加而急剧增加,因此多重检验调整往往过于保守。此外,感兴趣的分组效应可能会被异质性所掩盖。这些因素可能会导致错误地得出微生物组组成没有差异的结论。
我们将识别差异特征(在两组比较中差异显著的特征,例如治疗组与对照组)的问题转化为一种从随机背景中分离信号的动态布局。更具体地说,我们逐步置换微生物样本的分组因素标签,并在每种情况下进行多次差异丰度检验。然后,我们将原始数据中最具差异特征的信号强度与其在置换中的表现进行比较,如果这些特征是从数据中识别出的真正阳性特征,我们将观察到明显的下降趋势。模拟和实际数据应用表明,当绘制显著特征数量与混合比例的关系时,所提出的方法会产生 U 型曲线。U 型曲线的形状可以传达微生物组与分组因素之间总体关联的强度。我们还定义了一个脆弱性指数来衡量发现的稳健性。最后,我们通过将观察数据中的 p 值与完全混合数据中的 p 值进行比较,来推荐所识别的特征。
我们已经开发了一个用户友好且高效的 R-shiny 工具,并进行了可视化处理。默认情况下,我们使用 Wilcoxon 秩和检验来计算 p 值,因为它是一种稳健的非参数检验。我们提出的方法也可以利用其他测试方法(如 DESeq)获得的 p 值。这表明渐进式置换方法具有扩展到新环境的潜力。