Suppr超能文献

用于破译和生成噬菌体基因组的长语境语言模型。

A long-context language model for deciphering and generating bacteriophage genomes.

机构信息

Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.

Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA.

出版信息

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.

摘要

受大型语言模型 (LLM) 的成功启发,我们为基因组开发了一种长语境生成模型。我们的多尺度变换模型 megaDNA 以核苷酸级别的标记化方式在未注释的噬菌体基因组上进行预训练。我们展示了我们模型的基础能力,包括预测必需基因、遗传变异效应、调控元件活性和未注释序列的分类学。此外,它还生成长达 96kb 的从头序列,其中包含潜在的调控元件和具有噬菌体相关功能的注释蛋白。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dd1/11525655/d4673651d333/41467_2024_53759_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验