Clark-Boucher Dylan, Coull Brent A, Reeder Harrison T, Wang Fenglei, Sun Qi, Starr Jacqueline R, Lee Kyu Ha
Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA.
Biostatistics, Massachusetts General Hospital, Boston, MA, USA.
BMC Bioinformatics. 2025 Jul 29;26(1):196. doi: 10.1186/s12859-025-06235-9.
A key challenge in differential abundance analysis (DAA) of microbial sequencing data is that the counts for each sample are compositional, resulting in potentially biased comparisons of the absolute abundance across study groups. Normalization-based DAA methods rely on external normalization factors that account for compositionality by standardizing the counts onto a common numerical scale. However, existing normalization methods have struggled to maintain the false discovery rate in settings where the variance or compositional bias is large. This article proposes a novel framework for normalization that can reduce bias in DAA by re-conceptualizing normalization as a group-level task. We present two new normalization methods within the group-wise framework: group-wise relative log expression (G-RLE) and fold-truncated sum scaling (FTSS).
G-RLE and FTSS achieve higher statistical power for identifying differentially abundant taxa than existing methods in model-based and synthetic data simulation settings. The two novel methods also maintain the false discovery rate in challenging scenarios where existing methods suffer. The best results are obtained from using FTSS normalization with the DAA method MetagenomeSeq.
Compared with other methods for normalizing compositional sequence count data prior to DAA, the proposed group-level normalization frameworks offer more robust statistical inference. With a solid mathematical foundation, validated performance in numerical studies, and publicly available software, these new methods can help improve rigor and reproducibility in microbiome research.
微生物测序数据的差异丰度分析(DAA)中的一个关键挑战是每个样本的计数具有组成性,这导致跨研究组的绝对丰度比较可能存在偏差。基于归一化的DAA方法依赖于外部归一化因子,通过将计数标准化到一个共同的数值尺度来考虑组成性。然而,在方差或组成偏差较大的情况下,现有的归一化方法难以维持错误发现率。本文提出了一种新的归一化框架,通过将归一化重新概念化为一个组级任务,可以减少DAA中的偏差。我们在组级框架内提出了两种新的归一化方法:组级相对对数表达(G-RLE)和倍数截断和缩放(FTSS)。
在基于模型和合成数据模拟设置中,G-RLE和FTSS在识别差异丰富的分类群方面比现有方法具有更高的统计功效。这两种新方法在现有方法表现不佳的具有挑战性的场景中也能维持错误发现率。使用FTSS归一化与DAA方法MetagenomeSeq可获得最佳结果。
与DAA之前用于归一化组成序列计数数据的其他方法相比,所提出的组级归一化框架提供了更稳健的统计推断。这些新方法具有坚实的数学基础、在数值研究中经过验证的性能以及公开可用的软件,有助于提高微生物组研究的严谨性和可重复性。