Austin George I, Korem Tal
Department of Biomedical Informatics, Columbia University Irving Medical, New York, New York, USA.
Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, New York, USA.
mSystems. 2025 May 20;10(5):e0002125. doi: 10.1128/msystems.00021-25. Epub 2025 May 2.
Gihawi et al. (mBio 14:e01607-23, 2023, https://doi.org/10.1128/mbio.01607-23) argued that the analysis of tumor-associated microbiome data by Poore et al. (Nature 579:567-574, 2020, https://doi.org/10.1038/s41586-020-2095-1) is invalid because features that were originally very sparse (genera with mostly zero read counts) became associated with the phenotype following batch correction. Here, we examine whether such an observation should necessarily indicate issues with processing or machine learning pipelines. We show counterexamples using the centered log ratio (CLR) transformation, which is often used for analysis of compositional microbiome data. The CLR transformation has similarities to voom-SNM, the batch-correction method brought into question by Gihawi et al., and yet is a sample-wise operation that cannot, in itself, "leak" information or invalidate downstream analyses. We show that because the CLR transformation divides each value by the geometric mean of its sample, common imputation strategies for missing or zero values result in transformed features that are associated with the geometric mean. Through analyses of both synthetic and vaginal microbiome data sets, we demonstrate that when the geometric mean is associated with a phenotype, sparse and CLR-transformed features will also become associated with it. We re-analyze features highlighted by Gihawi et al. and demonstrate that the phenomenon of sparse features becoming phenotype-associated can also be observed after a CLR transformation, which serves as a counterexample to the claim that such an observation necessarily means information leakage. While we do not intend to address other concerns regarding tumor microbiome analyses, validate Poore et al.'s results, or evaluate batch-correction pipelines, we conclude that because phenotype-associated features that were initially sparse can be created by a sample-wise transformation that cannot artifactually inflate machine learning performance, their detection is not independently sufficient to demonstrate information leakage in machine learning pipelines. Microbiome data are multivariate, and as such, a value of 0 carries a different meaning for each sample. Many transformations, including CLR and other batch-correction methods, are likewise multivariate, and, as these issues demonstrate, each individual feature should be interpreted with caution.
Gihawi et al. claim that finding that a transformation turned highly sparse (mostly zero) features into features that are associated with a phenotype is sufficient to conclude that there is information leakage and to invalidate an analysis. This claim has critical implications for both the debate regarding The Cancer Genome Atlas (TCGA) cancer microbiome analysis and for interpretation and evaluation of analyses in the microbiome field at large. We show by counterexamples and by reanalysis that such transformations can be valid.
吉哈维等人(《mBio》14:e01607 - 23, 2023, https://doi.org/10.1128/mbio.01607 - 23)认为,普尔等人(《自然》579:567 - 574, 2020, https://doi.org/10.1038/s41586 - 020 - 2095 - 1)对肿瘤相关微生物组数据的分析是无效的,因为最初非常稀疏的特征(大多数读取计数为零的属)在批次校正后与表型相关联。在此,我们研究这样的观察结果是否必然表明处理过程或机器学习流程存在问题。我们使用中心对数比(CLR)变换展示了反例,CLR变换常用于分析成分微生物组数据。CLR变换与吉哈维等人质疑的批次校正方法voom - SNM有相似之处,但它是一种逐样本操作,本身不会“泄露”信息或使下游分析无效。我们表明,由于CLR变换将每个值除以其样本的几何平均值,针对缺失值或零值的常见插补策略会导致变换后的特征与几何平均值相关联。通过对合成数据集和阴道微生物组数据集的分析,我们证明当几何平均值与表型相关联时,稀疏且经CLR变换的特征也会与之相关联。我们重新分析了吉哈维等人强调的特征,并证明在CLR变换后也能观察到稀疏特征与表型相关联的现象,这是对那种认为这种观察必然意味着信息泄露这一说法的反例。虽然我们不打算解决关于肿瘤微生物组分析的其他问题、验证普尔等人的结果或评估批次校正流程,但我们得出结论,由于最初稀疏的与表型相关的特征可以由一种不会人为提高机器学习性能的逐样本变换产生,它们的检测本身不足以证明机器学习流程中存在信息泄露。微生物组数据是多变量的,因此,对于每个样本,值0具有不同的含义。许多变换,包括CLR和其他批次校正方法,同样是多变量的,正如这些问题所表明的,每个单独的特征都应谨慎解释。
吉哈维等人声称,发现一种变换将高度稀疏(大多为零)的特征转变为与表型相关的特征足以得出存在信息泄露并使分析无效的结论。这一说法对于关于癌症基因组图谱(TCGA)癌症微生物组分析的争论以及整个微生物组领域分析的解释和评估都具有关键意义。我们通过反例和重新分析表明,这样的变换可能是有效的。