Program in Bioinformatics and Genomics, Pennsylvania State University, State College, Pennsylvania, United States of America.
College of Information Sciences and Technology, Pennsylvania State University, State College, Pennsylvania, United States of America.
PLoS Comput Biol. 2023 Nov 20;19(11):e1011659. doi: 10.1371/journal.pcbi.1011659. eCollection 2023 Nov.
By applying Differential Set Analysis (DSA) to sequence count data, researchers can determine whether groups of microbes or genes are differentially enriched. Yet sequence count data suffer from a scale limitation: these data lack information about the scale (i.e., size) of the biological system under study, leading some authors to call these data compositional (i.e., proportional). In this article, we show that commonly used DSA methods that rely on normalization make strong, implicit assumptions about the unmeasured system scale. We show that even small errors in these scale assumptions can lead to positive predictive values as low as 9%. To address this problem, we take three novel approaches. First, we introduce a sensitivity analysis framework to identify when modeling results are robust to such errors and when they are suspect. Unlike standard benchmarking studies, this framework does not require ground-truth knowledge and can therefore be applied to both simulated and real data. Second, we introduce a statistical test that provably controls Type-I error at a nominal rate despite errors in scale assumptions. Finally, we discuss how the impact of scale limitations depends on a researcher's scientific goals and provide tools that researchers can use to evaluate whether their goals are at risk from erroneous scale assumptions. Overall, the goal of this article is to catalyze future research into the impact of scale limitations in analyses of sequence count data; to illustrate that scale limitations can lead to inferential errors in practice; yet to also show that rigorous and reproducible scale reliant inference is possible if done carefully.
通过将差异集合分析(DSA)应用于序列计数数据,研究人员可以确定微生物或基因组是否存在差异丰度。然而,序列计数数据存在一个尺度限制:这些数据缺乏关于研究中生物系统规模(即大小)的信息,这导致一些作者将这些数据称为组成性(即比例性)。在本文中,我们表明,依赖于归一化的常用 DSA 方法对未测量的系统规模做出了强烈的隐含假设。我们表明,即使这些尺度假设中的微小错误也会导致阳性预测值低至 9%。为了解决这个问题,我们采用了三种新方法。首先,我们引入了一种敏感性分析框架,以确定建模结果对这些错误的稳健性以及何时存在可疑性。与标准基准测试研究不同,该框架不需要真实知识,因此可应用于模拟和真实数据。其次,我们引入了一种统计检验方法,即使在尺度假设存在错误的情况下,也能证明在名义速率下控制第一类错误。最后,我们讨论了尺度限制的影响如何取决于研究人员的科学目标,并提供了研究人员可以用来评估其目标是否存在因错误的尺度假设而带来风险的工具。总体而言,本文的目的是促进未来对序列计数数据分析中尺度限制的影响的研究;说明在实践中,尺度限制可能导致推断错误;然而,也表明如果谨慎进行,严格且可重复的依赖于尺度的推断是可能的。