用于破译和生成噬菌体基因组的长语境语言模型。

A long-context language model for deciphering and generating bacteriophage genomes.

机构信息

Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.

Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA.

出版信息

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

DOI:10.1038/s41467-024-53759-4

PMID:39477977

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11525655/

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.

摘要

受大型语言模型 (LLM) 的成功启发，我们为基因组开发了一种长语境生成模型。我们的多尺度变换模型 megaDNA 以核苷酸级别的标记化方式在未注释的噬菌体基因组上进行预训练。我们展示了我们模型的基础能力，包括预测必需基因、遗传变异效应、调控元件活性和未注释序列的分类学。此外，它还生成长达 96kb 的从头序列，其中包含潜在的调控元件和具有噬菌体相关功能的注释蛋白。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dd1/11525655/d4673651d333/41467_2024_53759_Fig1_HTML.jpg

相似文献

A long-context language model for deciphering and generating bacteriophage genomes.

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.

NAR Genom Bioinform. 2024 Sep 18;6(3):lqae129. doi: 10.1093/nargab/lqae129. eCollection 2024 Sep.

In silico characterization of DNA motifs with particular reference to promoters and terminators.

Methods Mol Biol. 2009;502:113-29. doi: 10.1007/978-1-60327-565-1_8.

DeepPL: A deep-learning-based tool for the prediction of bacteriophage lifecycle.

PLoS Comput Biol. 2024 Oct 17;20(10):e1012525. doi: 10.1371/journal.pcbi.1012525. eCollection 2024 Oct.

GOPhage: protein function annotation for bacteriophages by integrating the genomic context.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf014.

Characterization of Paenibacillus larvae bacteriophages and their genomic relationships to firmicute bacteriophages.

BMC Genomics. 2014 Aug 30;15(1):745. doi: 10.1186/1471-2164-15-745.

Genome Integration and Excision by a New Streptomyces Bacteriophage, ϕJoe.

Appl Environ Microbiol. 2017 Feb 15;83(5). doi: 10.1128/AEM.02767-16. Print 2017 Mar 1.

A Bioinformatic Ecosystem for Bacteriophage Genomics: PhaMMSeqs, Phamerator, pdm_utils, PhagesDB, DEPhT, and PhamClust.

Viruses. 2024 Aug 10;16(8):1278. doi: 10.3390/v16081278.

PhiGO, a phage ontology associated with the ACLAME database.

Res Microbiol. 2007 Sep;158(7):567-71. doi: 10.1016/j.resmic.2007.05.002. Epub 2007 May 21.

Comparative genomic analysis of 142 bacteriophages infecting Salmonella enterica subsp. enterica.

BMC Genomics. 2020 May 26;21(1):374. doi: 10.1186/s12864-020-6765-z.

引用本文的文献

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

BMC Biol. 2025 Aug 8;23(1):250. doi: 10.1186/s12915-025-02348-y.

Creating interpretable deep learning models to identify species using environmental DNA sequences.

Sci Rep. 2025 Jul 28;15(1):27436. doi: 10.1038/s41598-025-09846-7.

Viromics approaches for the study of viral diversity and ecology in microbiomes.

Nat Rev Genet. 2025 Jul 21. doi: 10.1038/s41576-025-00871-w.

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

Genome Biol. 2025 Jul 14;26(1):203. doi: 10.1186/s13059-025-03674-8.

Optimizing phage therapy with artificial intelligence: a perspective.

Front Cell Infect Microbiol. 2025 May 27;15:1611857. doi: 10.3389/fcimb.2025.1611857. eCollection 2025.

ABI and generative biology: A new paradigm for gene therapy, genome engineering, and engineered cell therapy.

Mol Ther. 2025 May 7;33(5):1881-1885. doi: 10.1016/j.ymthe.2025.02.021. Epub 2025 Mar 21.

DeepInterAware: Deep Interaction Interface-Aware Network for Improving Antigen-Antibody Interaction Prediction from Sequence Data.

Adv Sci (Weinh). 2025 Apr;12(13):e2412533. doi: 10.1002/advs.202412533. Epub 2025 Feb 11.

The design and engineering of synthetic genomes.

Nat Rev Genet. 2025 May;26(5):298-319. doi: 10.1038/s41576-024-00786-y. Epub 2024 Nov 6.

Protein Set Transformer: A protein-based genome language model to power high diversity viromics.

Res Sq. 2024 Sep 23:rs.3.rs-4844047. doi: 10.21203/rs.3.rs-4844047/v1.

Protein Set Transformer: A protein-based genome language model to power high diversity viromics.

bioRxiv. 2024 Jul 29:2024.07.26.605391. doi: 10.1101/2024.07.26.605391.

本文引用的文献

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.

NAR Genom Bioinform. 2024 Sep 18;6(3):lqae129. doi: 10.1093/nargab/lqae129. eCollection 2024 Sep.

Genomic language model predicts protein co-regulation and function.

Nat Commun. 2024 Apr 3;15(1):2880. doi: 10.1038/s41467-024-46947-9.

Protein design meets biosecurity.

Science. 2024 Jan 26;383(6681):349. doi: 10.1126/science.ado1671. Epub 2024 Jan 25.

Systematic and scalable genome-wide essentiality mapping to identify nonessential genes in phages.

PLoS Biol. 2023 Dec 4;21(12):e3002416. doi: 10.1371/journal.pbio.3002416. eCollection 2023 Dec.

DNA language models are powerful predictors of genome-wide variant effects.

Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.

Identification of mobile genetic elements with geNomad.

Nat Biotechnol. 2024 Aug;42(8):1303-1312. doi: 10.1038/s41587-023-01953-y. Epub 2023 Sep 21.

Fast and accurate protein structure search with Foldseek.

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

Evolutionary-scale prediction of atomic-level protein structure with a language model.

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Pharokka: a fast scalable bacteriophage annotation tool.

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac776.

Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria.

Nat Commun. 2022 Sep 2;13(1):5159. doi: 10.1038/s41467-022-32829-5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于破译和生成噬菌体基因组的长语境语言模型。

A long-context language model for deciphering and generating bacteriophage genomes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于破译和生成噬菌体基因组的长语境语言模型。

A long-context language model for deciphering and generating bacteriophage genomes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献