Suppr超能文献

生成式语言模型在人类基因核苷酸序列上的应用。

Generative language models on nucleotide sequences of human genes.

机构信息

Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.

出版信息

Sci Rep. 2024 Sep 27;14(1):22204. doi: 10.1038/s41598-024-72512-x.

Abstract

Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

摘要

语言模型,特别是基于转换器的语言模型,在自然语言处理方面取得了巨大的成功。确切地说,像 BERT 这样用于自然语言理解的研究和像 GPT-3 这样用于自然语言生成的工作非常重要。如果我们将 DNA 序列视为用代表核苷酸的四个字母书写的文本,那么它们在结构上与自然语言相似。这种相似性导致了在 DNA 相关生物信息学领域开发了判别式语言模型,如 DNABERT。然而,据我们所知,这一方面的生成能力在很大程度上仍未得到探索。因此,我们专注于开发像 GPT-3 这样的自回归生成语言模型,用于 DNA 序列。由于没有广泛的计算资源,直接处理整个 DNA 序列具有挑战性,因此我们决定在较小的规模上进行研究,并专注于人类基因的核苷酸序列,即具有特定功能的 DNA 的独特部分,而不是整个 DNA。这一决定并没有显著改变问题的结构,因为 DNA 和基因都可以被视为由四个不同核苷酸组成的 1D 序列,在不丢失太多信息且不简化的情况下。首先,我们系统地研究了一个几乎完全未被探索的问题,并观察到递归神经网络 (RNN) 表现最佳,而像 N-gram 这样的简单技术也很有前途。另一个有益的方面是学习如何在我们不理解的语言上使用生成模型,这与自然语言不同。注意到使用超越困惑度等经典指标的真实世界任务的重要性。此外,我们还研究了通过选择词汇量最小的语言(由于有四种不同的核苷酸,因此词汇量为 4),这些模型的数据饥渴性质是否可以改变。之所以要回顾这一点,是因为选择这样的语言可能会使问题变得更容易。然而,在这项研究中,我们发现这并没有极大地改变所需的数据量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/448b22846186/41598_2024_72512_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验