Myers Tyler, Song Se Jin, Chen Yang, De Pessemier Britta, Khatib Lora, McDonald Daniel, Huang Shi, Gallo Richard, Callewaert Chris, Havulinna Aki S, Lahti Leo, Roeselers Guus, Laiola Manolo, Shetty Sudarshan A, Kelley Scott T, Knight Rob, Bartko Andrew
Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA.
Shu Chien-Gene Lay Department of Bioengineering, University of California San Diego, La Jolla, CA, USA.
Commun Biol. 2025 Aug 6;8(1):1159. doi: 10.1038/s42003-025-08590-y.
Deep learning for microbiome analysis has shown potential for understanding microbial communities and human phenotypes. Here, we propose an approach, Transformer-based Robust Principal Component Analysis(TRPCA), which leverages the strengths of transformer architectures and interpretability of Robust Principal Component Analysis. To investigate benefits of TRPCA over conventional machine learning models, we benchmarked performance on age prediction from three body sites(skin, oral, gut), with 16S rRNA gene amplicon(16S) and whole-genome sequencing(WGS) data. We demonstrated prediction of age from longitudinal samples and combined classification and regression tasks via multi-task learning(MTL). TRPCA improves age prediction accuracy from human microbiome samples, achieving the largest reduction in Mean Absolute Error for WGS skin (MAE: 8.03, 28% reduction) and 16S skin (MAE: 5.09, 14% reduction) samples, compared to conventional approaches. Additionally, TRPCA's MTL approach achieves an accuracy of 89% for birth country prediction across 5 countries, while improving age prediction from WGS stool samples. Notably, TRPCA uncovers a link between subject and error prediction through residual analysis for paired samples across sequencing method (16S/WGS) and body site(oral/gut). These findings highlight TRPCA's utility in improving age prediction while maintaining feature-level interpretability, and elucidating connections between individuals and microbiomes.
用于微生物组分析的深度学习已显示出在理解微生物群落和人类表型方面的潜力。在此,我们提出一种方法,即基于Transformer的稳健主成分分析(TRPCA),它利用了Transformer架构的优势以及稳健主成分分析的可解释性。为了研究TRPCA相对于传统机器学习模型的优势,我们以来自三个身体部位(皮肤、口腔、肠道)的年龄预测为基准,使用16S rRNA基因扩增子(16S)和全基因组测序(WGS)数据进行性能测试。我们展示了通过纵向样本进行年龄预测以及通过多任务学习(MTL)结合分类和回归任务。与传统方法相比,TRPCA提高了从人类微生物组样本中预测年龄的准确性,在WGS皮肤样本(平均绝对误差:8.03,降低28%)和16S皮肤样本(平均绝对误差:5.09,降低14%)中实现了最大幅度的平均绝对误差降低。此外,TRPCA的MTL方法在预测5个国家的出生国家时准确率达到89%,同时提高了从WGS粪便样本中预测年龄的准确性。值得注意的是,TRPCA通过对跨测序方法(16S/WGS)和身体部位(口腔/肠道)的配对样本进行残差分析,揭示了个体与误差预测之间的联系。这些发现突出了TRPCA在提高年龄预测准确性同时保持特征级可解释性以及阐明个体与微生物组之间联系方面的效用。