Yan Binghao, Nam Yunbi, Li Lingyao, Deek Rebecca A, Li Hongzhe, Ma Siyuan
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States.
Front Genet. 2025 Jan 7;15:1494474. doi: 10.3389/fgene.2024.1494474. eCollection 2024.
Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a , enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.
深度学习领域的最新进展,尤其是大语言模型(LLMs),对研究人员研究微生物组和宏基因组学数据的方式产生了重大影响。微生物蛋白质和基因组序列与自然语言一样,形成了一种 ,使得能够采用大语言模型从复杂的微生物生态中提取有用的见解。在本文中,我们回顾了深度学习和语言模型在分析微生物组和宏基因组学数据方面的应用。我们重点关注问题的提出、必要的数据集以及语言建模技术的整合。我们广泛概述了蛋白质/基因组语言建模及其对微生物组研究的贡献。我们还讨论了诸如新型病毒组学语言建模、生物合成基因簇预测以及宏基因组学研究的知识整合等应用。