由Transformer模型生成的噬菌体基因组在组成上与天然序列不同。

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.

作者信息

Ratcliff Jeremy

机构信息

Johns Hopkins University Applied Physics Laboratory, 11000 Johns Hopkins Road, 20723 Maryland, Laurel, MD 20723, USA.

出版信息

NAR Genom Bioinform. 2024 Sep 18;6(3):lqae129. doi: 10.1093/nargab/lqae129. eCollection 2024 Sep.

DOI:10.1093/nargab/lqae129

PMID:39296932

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11409064/

Abstract

Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% ( = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.

摘要

语言模型在基因组学中的新应用有望对该领域产生重大影响。megaDNA模型是首个可公开获取的用于创建合成病毒基因组的生成模型。为了评估megaDNA重现病毒非随机基因组组成的能力，并评估合成基因组是否能通过算法检测，我们比较了4969个天然噬菌体基因组和1002个合成噬菌体基因组的组成指标。Transformer生成的序列具有不同但现实的基因组长度，并且58%被geNomad分类为病毒序列。然而，通过秩和检验和主成分分析与天然噬菌体基因组相比，这些序列在各种组成指标上表现出一致的差异。仅基于全局组成指标训练用于检测Transformer生成序列的简单神经网络显示出中位数敏感性为93.0%，特异性为97.9%（n = 12个独立模型）。总体而言，这些结果表明megaDNA尚未生成具有现实组成偏差的噬菌体基因组，并且基因组组成是检测该模型生成序列的可靠方法。虽然结果特定于megaDNA模型，但这里描述的评估框架可应用于任何基因组序列生成模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fac/11409064/b2820a400b8e/lqae129fig1.jpg

相似文献

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.由Transformer模型生成的噬菌体基因组在组成上与天然序列不同。

NAR Genom Bioinform. 2024 Sep 18;6(3):lqae129. doi: 10.1093/nargab/lqae129. eCollection 2024 Sep.

IsoPlotter(+): A Tool for Studying the Compositional Architecture of Genomes.IsoPlotter(+)：一种用于研究基因组组成结构的工具。

ISRN Bioinform. 2013 Apr 18;2013:725434. doi: 10.1155/2013/725434. eCollection 2013.

Protein Set Transformer: A protein-based genome language model to power high diversity viromics.蛋白质集变换器：一种为高多样性病毒组学提供支持的基于蛋白质的基因组语言模型。

bioRxiv. 2024 Jul 29:2024.07.26.605391. doi: 10.1101/2024.07.26.605391.

Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Llamol: a dynamic multi-conditional generative transformer for de novo molecular design.Llamol：一种用于从头分子设计的动态多条件生成式变换器。

J Cheminform. 2024 Jun 21;16(1):73. doi: 10.1186/s13321-024-00863-8.

Identification of mobile genetic elements with geNomad.使用 geNomad 识别移动遗传元件。

Nat Biotechnol. 2024 Aug;42(8):1303-1312. doi: 10.1038/s41587-023-01953-y. Epub 2023 Sep 21.

Assessment of compositional heterogeneity within and between eukaryotic genomes.真核生物基因组内部和之间的组成异质性评估。

Genome Res. 2000 Dec;10(12):1986-95. doi: 10.1101/gr.10.12.1986.

Classifying the Unclassified: A Phage Classification Method.分类未分类：一种噬菌体分类方法。

Viruses. 2019 Feb 24;11(2):195. doi: 10.3390/v11020195.

Novel metrics for quantifying bacterial genome composition skews.量化细菌基因组组成偏倚的新指标。

BMC Genomics. 2018 Jul 11;19(1):528. doi: 10.1186/s12864-018-4913-5.

Res Sq. 2024 Sep 23:rs.3.rs-4844047. doi: 10.21203/rs.3.rs-4844047/v1.

引用本文的文献

ABI and generative biology: A new paradigm for gene therapy, genome engineering, and engineered cell therapy.ABI与生殖生物学：基因治疗、基因组工程和工程细胞治疗的新范式。

Mol Ther. 2025 May 7;33(5):1881-1885. doi: 10.1016/j.ymthe.2025.02.021. Epub 2025 Mar 21.

A long-context language model for deciphering and generating bacteriophage genomes.用于破译和生成噬菌体基因组的长语境语言模型。

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

Genomic Language Models: Opportunities and Challenges.基因组语言模型：机遇与挑战。

ArXiv. 2024 Sep 22:arXiv:2407.11435v2.

本文引用的文献

Dinucleotide biases in the genomes of prokaryotic and eukaryotic dsDNA viruses and their hosts.原核生物和真核生物 dsDNA 病毒及其宿主基因组中的二核苷酸偏向性。

Mol Ecol. 2024 Mar;33(6):e17287. doi: 10.1111/mec.17287. Epub 2024 Jan 23.

Deep generative design of RNA family sequences.深度生成设计 RNA 家族序列。

Nat Methods. 2024 Mar;21(3):435-443. doi: 10.1038/s41592-023-02148-8. Epub 2024 Jan 18.

Identification of mobile genetic elements with geNomad.使用 geNomad 识别移动遗传元件。

Nat Biotechnol. 2024 Aug;42(8):1303-1312. doi: 10.1038/s41587-023-01953-y. Epub 2023 Sep 21.

Deep generative molecular design reshapes drug discovery.深度生成分子设计重塑药物发现。

Cell Rep Med. 2022 Dec 20;3(12):100794. doi: 10.1016/j.xcrm.2022.100794. Epub 2022 Oct 27.

Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation.Ig-VAE：通过直接 3D 坐标生成对蛋白质结构进行生成式建模。

PLoS Comput Biol. 2022 Jun 27;18(6):e1010271. doi: 10.1371/journal.pcbi.1010271. eCollection 2022 Jun.

Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用，从序列中有效预测基因表达。

Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.

Compositional biases in RNA viruses: Causes, consequences and applications.RNA病毒中的组成性偏差：原因、后果及应用

Wiley Interdiscip Rev RNA. 2022 Mar;13(2):e1679. doi: 10.1002/wrna.1679. Epub 2021 Jun 21.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

Array programming with NumPy.使用 NumPy 进行数组编程。

Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.

PHANOTATE: a novel approach to gene identification in phage genomes.phanotate：一种在噬菌体基因组中进行基因鉴定的新方法。

Bioinformatics. 2019 Nov 1;35(22):4537-4542. doi: 10.1093/bioinformatics/btz265.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

由Transformer模型生成的噬菌体基因组在组成上与天然序列不同。

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献