Suppr超能文献

学习用于微生物群落的深度语言模型:大规模未标记微生物群落数据的力量。

Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.

作者信息

Pope Quintin, Varma Rohan, Tataru Christine, David Maude M, Fern Xiaoli

机构信息

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America.

Department of Pathology, Brigham and Women's Hospital, Boston, Massachusetts, United States of America.

出版信息

PLoS Comput Biol. 2025 May 7;21(5):e1011353. doi: 10.1371/journal.pcbi.1011353. eCollection 2025 May.

Abstract

We use open source human gut microbiome data to learn a microbial "language" model by adapting techniques from Natural Language Processing (NLP). Our microbial "language" model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals.

摘要

我们使用开源的人类肠道微生物组数据,通过采用自然语言处理(NLP)技术来学习一种微生物“语言”模型。我们的微生物“语言”模型以自监督方式(即无需额外的外部标签)进行训练,以捕捉不同微生物分类群之间的相互作用以及微生物群落中的常见组成模式。所学习的模型生成上下文相关的分类群表示,使得单个微生物分类群能够根据其出现的特定微生物环境以不同方式表示。该模型还通过将样本中的不同微生物分类群及其相互作用作为一个整体进行综合解释,提供样本表示。我们证明,虽然我们的样本表示在诸如预测肠易激综合征(IBD)和饮食模式等领域内预测任务中与基线模型表现相当,但在推广到来自独立研究的测试数据时,即使存在显著的分布变化,它也显著优于基线模型。通过各种分析,我们进一步表明,尽管该模型从未明确接触过此类信号,但预训练的、上下文敏感的嵌入捕捉到了有意义的生物学信息,包括分类关系、与生物途径的相关性以及与IBD表达的相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0654/12058177/4800b16807d8/pcbi.1011353.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验