• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成式语言模型在人类基因核苷酸序列上的应用。

Generative language models on nucleotide sequences of human genes.

机构信息

Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.

出版信息

Sci Rep. 2024 Sep 27;14(1):22204. doi: 10.1038/s41598-024-72512-x.

DOI:10.1038/s41598-024-72512-x
PMID:39333252
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11437190/
Abstract

Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

摘要

语言模型,特别是基于转换器的语言模型,在自然语言处理方面取得了巨大的成功。确切地说,像 BERT 这样用于自然语言理解的研究和像 GPT-3 这样用于自然语言生成的工作非常重要。如果我们将 DNA 序列视为用代表核苷酸的四个字母书写的文本,那么它们在结构上与自然语言相似。这种相似性导致了在 DNA 相关生物信息学领域开发了判别式语言模型,如 DNABERT。然而,据我们所知,这一方面的生成能力在很大程度上仍未得到探索。因此,我们专注于开发像 GPT-3 这样的自回归生成语言模型,用于 DNA 序列。由于没有广泛的计算资源,直接处理整个 DNA 序列具有挑战性,因此我们决定在较小的规模上进行研究,并专注于人类基因的核苷酸序列,即具有特定功能的 DNA 的独特部分,而不是整个 DNA。这一决定并没有显著改变问题的结构,因为 DNA 和基因都可以被视为由四个不同核苷酸组成的 1D 序列,在不丢失太多信息且不简化的情况下。首先,我们系统地研究了一个几乎完全未被探索的问题,并观察到递归神经网络 (RNN) 表现最佳,而像 N-gram 这样的简单技术也很有前途。另一个有益的方面是学习如何在我们不理解的语言上使用生成模型,这与自然语言不同。注意到使用超越困惑度等经典指标的真实世界任务的重要性。此外,我们还研究了通过选择词汇量最小的语言(由于有四种不同的核苷酸,因此词汇量为 4),这些模型的数据饥渴性质是否可以改变。之所以要回顾这一点,是因为选择这样的语言可能会使问题变得更容易。然而,在这项研究中,我们发现这并没有极大地改变所需的数据量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/680af8ea4d26/41598_2024_72512_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/448b22846186/41598_2024_72512_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/47ec02fe0231/41598_2024_72512_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/21b47c6b1ce5/41598_2024_72512_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/e62f7d113a4d/41598_2024_72512_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/50a82d8896e0/41598_2024_72512_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/95ebd90c6c2e/41598_2024_72512_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/943fbac239b4/41598_2024_72512_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/680af8ea4d26/41598_2024_72512_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/448b22846186/41598_2024_72512_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/47ec02fe0231/41598_2024_72512_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/21b47c6b1ce5/41598_2024_72512_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/e62f7d113a4d/41598_2024_72512_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/50a82d8896e0/41598_2024_72512_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/95ebd90c6c2e/41598_2024_72512_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/943fbac239b4/41598_2024_72512_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a650/11437190/680af8ea4d26/41598_2024_72512_Fig8_HTML.jpg

相似文献

1
Generative language models on nucleotide sequences of human genes.生成式语言模型在人类基因核苷酸序列上的应用。
Sci Rep. 2024 Sep 27;14(1):22204. doi: 10.1038/s41598-024-72512-x.
2
A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.基于 BERT 和二维卷积神经网络的变压器架构,用于从序列信息中识别 DNA 增强子。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.
3
BioGPT: generative pre-trained transformer for biomedical text generation and mining.BioGPT:用于生物医学文本生成和挖掘的生成式预训练转换器。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.
4
A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records.基于大型语言模型的生成式自然语言处理框架,在临床笔记上进行了微调,能够从电子健康记录中准确提取头痛频率。
Headache. 2024 Apr;64(4):400-409. doi: 10.1111/head.14702. Epub 2024 Mar 25.
5
Distinguishing word identity and sequence context in DNA language models.在 DNA 语言模型中区分单词身份和序列上下文。
BMC Bioinformatics. 2024 Sep 13;25(1):301. doi: 10.1186/s12859-024-05869-5.
6
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
7
Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need.生成式大语言模型是通用文本分析引擎:文本到文本学习就是你所需要的一切。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1892-1903. doi: 10.1093/jamia/ocae078.
8
Molecular language models: RNNs or transformer?分子语言模型:RNN 还是转换器?
Brief Funct Genomics. 2023 Jul 17;22(4):392-400. doi: 10.1093/bfgp/elad012.
9
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
10
Crystal Composition Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials.晶体成分变换器:用于材料生成与改进设计的自学习神经语言模型
Adv Sci (Weinh). 2024 Sep;11(36):e2304305. doi: 10.1002/advs.202304305. Epub 2024 Aug 5.

引用本文的文献

1
Regulating genome language models: navigating policy challenges at the intersection of AI and genetics.规范基因组语言模型:应对人工智能与遗传学交叉领域的政策挑战
Hum Genet. 2025 Sep 16. doi: 10.1007/s00439-025-02768-4.

本文引用的文献

1
Species-aware DNA language models capture regulatory elements and their evolution.物种感知的 DNA 语言模型可以捕获调控元件及其进化。
Genome Biol. 2024 Apr 2;25(1):83. doi: 10.1186/s13059-024-03221-x.
2
AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks.AIPs-SnTCN:使用基于fastText和基于Transformer编码器的混合词嵌入与自归一化时间卷积网络预测抗炎肽
J Chem Inf Model. 2023 Nov 13;63(21):6537-6554. doi: 10.1021/acs.jcim.3c01563. Epub 2023 Oct 31.
3
DNA language models are powerful predictors of genome-wide variant effects.
DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.
4
Evaluating native-like structures of RNA-protein complexes through the deep learning method.通过深度学习方法评估 RNA-蛋白质复合物的天然结构。
Nat Commun. 2023 Feb 24;14(1):1060. doi: 10.1038/s41467-023-36720-9.
5
BioGPT: generative pre-trained transformer for biomedical text generation and mining.BioGPT:用于生物医学文本生成和挖掘的生成式预训练转换器。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.
6
cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model.cACP-DeepGram:基于深度神经网络和 Skip-Gram 词嵌入模型的抗癌肽分类。
Artif Intell Med. 2022 Sep;131:102349. doi: 10.1016/j.artmed.2022.102349. Epub 2022 Jul 6.
7
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
8
BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models.BioSeq-BLM:一个基于生物语言模型分析 DNA、RNA 和蛋白质序列的平台。
Nucleic Acids Res. 2021 Dec 16;49(22):e129. doi: 10.1093/nar/gkab829.
9
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
10
Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA.机器学习算法在DNA序列数据挖掘中的应用综述
Front Bioeng Biotechnol. 2020 Sep 4;8:1032. doi: 10.3389/fbioe.2020.01032. eCollection 2020.