Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.
Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA.
Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.
Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.
受大型语言模型 (LLM) 的成功启发,我们为基因组开发了一种长语境生成模型。我们的多尺度变换模型 megaDNA 以核苷酸级别的标记化方式在未注释的噬菌体基因组上进行预训练。我们展示了我们模型的基础能力,包括预测必需基因、遗传变异效应、调控元件活性和未注释序列的分类学。此外,它还生成长达 96kb 的从头序列,其中包含潜在的调控元件和具有噬菌体相关功能的注释蛋白。