Jiang Yifan, Liao Disen, Zhu Qiyun, Lu Yang Young
Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
School of Life Sciences, Arizona State University, Tempe, AZ, 85281, United States.
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf014.
Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.
Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.
The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).
了解性状与微生物组成之间的关联是微生物组研究的一个基本目标。最近,研究人员已转向机器学习(ML)模型来实现这一目标,并取得了令人鼓舞的成果。然而,先进的ML模型的有效性往往受到微生物组数据独特特征的限制,这些数据通常是高维的、成分性的和不平衡的。这些特征可能会阻碍模型在预测分析中充分探索分类群之间关系的能力。为应对这一挑战,数据增强变得至关重要。它涉及基于现有数据生成带有人工标签的合成样本,并将这些样本纳入训练集以提高ML模型性能。
在此,我们提出了PhyloMix,一种专门为微生物组数据设计的新型数据增强方法,以增强预测分析。PhyloMix利用微生物组分类群之间的系统发育关系作为信息先验,以指导合成微生物样本的生成。借助系统发育,PhyloMix通过从一个样本中移除一个子树并将其与另一个样本的相应子树组合来创建新样本。值得注意的是,PhyloMix旨在解决微生物组数据的成分性质,有效处理原始计数和相对丰度。这种方法在增强样本中引入了足够的多样性,从而提高了预测性能。我们在五个常用的ML模型上对六个真实的微生物组数据集进行了实证评估。PhyloMix明显优于不同的基线方法,包括基于样本混合的数据增强技术,如普通混合和成分CutMix,以及基于系统发育的方法TADA。我们还展示了PhyloMix在监督学习和对比表示学习中的广泛适用性。
遵循Apache许可的源代码可在(https://github.com/batmen-lab/phylomix)获取。