学习用于微生物群落的深度语言模型：大规模未标记微生物群落数据的力量。

Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.

作者信息

Pope Quintin, Varma Rohan, Tataru Christine, David Maude M, Fern Xiaoli

机构信息

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America.

Department of Pathology, Brigham and Women's Hospital, Boston, Massachusetts, United States of America.

出版信息

PLoS Comput Biol. 2025 May 7;21(5):e1011353. doi: 10.1371/journal.pcbi.1011353. eCollection 2025 May.

DOI:10.1371/journal.pcbi.1011353

PMID:40334224

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12058177/

Abstract

We use open source human gut microbiome data to learn a microbial "language" model by adapting techniques from Natural Language Processing (NLP). Our microbial "language" model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals.

摘要

我们使用开源的人类肠道微生物组数据，通过采用自然语言处理（NLP）技术来学习一种微生物“语言”模型。我们的微生物“语言”模型以自监督方式（即无需额外的外部标签）进行训练，以捕捉不同微生物分类群之间的相互作用以及微生物群落中的常见组成模式。所学习的模型生成上下文相关的分类群表示，使得单个微生物分类群能够根据其出现的特定微生物环境以不同方式表示。该模型还通过将样本中的不同微生物分类群及其相互作用作为一个整体进行综合解释，提供样本表示。我们证明，虽然我们的样本表示在诸如预测肠易激综合征（IBD）和饮食模式等领域内预测任务中与基线模型表现相当，但在推广到来自独立研究的测试数据时，即使存在显著的分布变化，它也显著优于基线模型。通过各种分析，我们进一步表明，尽管该模型从未明确接触过此类信号，但预训练的、上下文敏感的嵌入捕捉到了有意义的生物学信息，包括分类关系、与生物途径的相关性以及与IBD表达的相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0654/12058177/4800b16807d8/pcbi.1011353.g001.jpg

相似文献

Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.

PLoS Comput Biol. 2025 May 7;21(5):e1011353. doi: 10.1371/journal.pcbi.1011353. eCollection 2025 May.

Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network.

PLoS Comput Biol. 2021 Sep 22;17(9):e1009345. doi: 10.1371/journal.pcbi.1009345. eCollection 2021 Sep.

A novel graph theoretical approach for modeling microbiomes and inferring microbial ecological relationships.

BMC Genomics. 2019 Dec 20;20(Suppl 11):945. doi: 10.1186/s12864-019-6288-7.

Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease.

PLoS Comput Biol. 2020 May 4;16(5):e1007859. doi: 10.1371/journal.pcbi.1007859. eCollection 2020 May.

Predicting Host Phenotype Based on Gut Microbiome Using a Convolutional Neural Network Approach.

Methods Mol Biol. 2021;2190:249-266. doi: 10.1007/978-1-0716-0826-5_12.

Gut microbiota as potential orchestrators of irritable bowel syndrome.

Gut Liver. 2015 May 23;9(3):318-31. doi: 10.5009/gnl14344.

Study protocol of the Bergen brain-gut-microbiota-axis study: A prospective case-report characterization and dietary intervention study to evaluate the effects of microbiota alterations on cognition and anatomical and functional brain connectivity in patients with irritable bowel syndrome.

Medicine (Baltimore). 2020 Sep 11;99(37):e21950. doi: 10.1097/MD.0000000000021950.

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

Universal gut microbial relationships in the gut microbiome of wild baboons.

Elife. 2023 May 9;12:e83152. doi: 10.7554/eLife.83152.

A novel deep learning method for predictive modeling of microbiome data.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa073.

本文引用的文献

Machine learning approaches in microbiome research: challenges and best practices.

Front Microbiol. 2023 Sep 22;14:1261889. doi: 10.3389/fmicb.2023.1261889. eCollection 2023.

Faecal microbiome-based machine learning for multi-class disease diagnosis.

Nat Commun. 2022 Nov 10;13(1):6818. doi: 10.1038/s41467-022-34405-3.

ProteinBERT: a universal deep-learning model of protein sequence and function.

Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.

GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison.

Nucleic Acids Res. 2022 Jan 7;50(D1):D777-D784. doi: 10.1093/nar/gkab1019.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease.

PLoS Comput Biol. 2020 May 4;16(5):e1007859. doi: 10.1371/journal.pcbi.1007859. eCollection 2020 May.

DeepMicro: deep representation learning for disease prediction based on microbiome data.

Sci Rep. 2020 Apr 7;10(1):6026. doi: 10.1038/s41598-020-63159-5.

SciPy 1.0: fundamental algorithms for scientific computing in Python.

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

学习用于微生物群落的深度语言模型：大规模未标记微生物群落数据的力量。

Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献