Ratcliff Jeremy
Johns Hopkins University Applied Physics Laboratory, 11000 Johns Hopkins Road, 20723 Maryland, Laurel, MD 20723, USA.
NAR Genom Bioinform. 2024 Sep 18;6(3):lqae129. doi: 10.1093/nargab/lqae129. eCollection 2024 Sep.
Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% ( = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.
语言模型在基因组学中的新应用有望对该领域产生重大影响。megaDNA模型是首个可公开获取的用于创建合成病毒基因组的生成模型。为了评估megaDNA重现病毒非随机基因组组成的能力,并评估合成基因组是否能通过算法检测,我们比较了4969个天然噬菌体基因组和1002个合成噬菌体基因组的组成指标。Transformer生成的序列具有不同但现实的基因组长度,并且58%被geNomad分类为病毒序列。然而,通过秩和检验和主成分分析与天然噬菌体基因组相比,这些序列在各种组成指标上表现出一致的差异。仅基于全局组成指标训练用于检测Transformer生成序列的简单神经网络显示出中位数敏感性为93.0%,特异性为97.9%(n = 12个独立模型)。总体而言,这些结果表明megaDNA尚未生成具有现实组成偏差的噬菌体基因组,并且基因组组成是检测该模型生成序列的可靠方法。虽然结果特定于megaDNA模型,但这里描述的评估框架可应用于任何基因组序列生成模型。