• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

是否仅需基因组语言模型?探索基因组语言模型在蛋白质下游任务中的应用。

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.

机构信息

InstaDeep, Cambridge, MA 02142, United States.

InstaDeep, Paris 75010, France.

出版信息

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae529.

DOI:10.1093/bioinformatics/btae529
PMID:39212609
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11399231/
Abstract

MOTIVATION

Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.

RESULTS

In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.

AVAILABILITY AND IMPLEMENTATION

We make our inference code, 3mer pre-trained model weights and datasets available.

摘要

动机

在大量生物序列语料库上进行训练的大型语言模型,是用于下游基因组和蛋白质组任务的最新技术。由于基因组包含了编码所有蛋白质的信息,因此基因组语言模型(gLMs)有可能不仅对 DNA 序列,而且对蛋白质进行下游预测。然而,由于很少有任务将蛋白质与可由 gLMs 处理的编码 DNA 序列(CDS)配对,因此 gLMs 在蛋白质任务上的性能仍然未知。

结果

在这项工作中,我们整理了五个这样的数据集,并使用它们来评估 gLMs 和蛋白质语言模型(pLMs)的性能。我们表明 gLMs 在某些任务上具有竞争力,甚至优于它们的 pLMs 对应物。与采样策略相比,使用检索到的 CDS 获得了最佳性能。我们发现,训练一个联合基因组-蛋白质模型优于每个单独的方法,这表明它们捕获了不同但互补的序列表示,正如我们通过对其嵌入的模型解释来证明的那样。最后,我们探索了不同的基因组标记方案,以提高下游蛋白质性能。我们使用 3mer 标记化训练了一个新的核苷酸转换器(50M)基础模型,该模型在蛋白质任务上的表现优于其 6mer 对应物,同时保持了基因组任务的性能。gLMs 在蛋白质组学中的应用有可能利用丰富的 CDS 数据,并且本着中心法则的精神,有可能采用一种统一的、协同的基因组学和蛋白质组学方法。

可用性和实现

我们提供推理代码、3mer 预训练模型权重和数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/d66ca2b7eb5b/btae529f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/c940eef4c590/btae529f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/dc6d2c6f065a/btae529f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/b6af25e221c0/btae529f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/b73064838e1c/btae529f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/45b11aadbeb0/btae529f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/d66ca2b7eb5b/btae529f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/c940eef4c590/btae529f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/dc6d2c6f065a/btae529f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/b6af25e221c0/btae529f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/b73064838e1c/btae529f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/45b11aadbeb0/btae529f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e9f/11399231/d66ca2b7eb5b/btae529f6.jpg

相似文献

1
Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.是否仅需基因组语言模型?探索基因组语言模型在蛋白质下游任务中的应用。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae529.
2
Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学方面的表征能力。
bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.
3
Democratizing protein language models with parameter-efficient fine-tuning.参数高效微调:用民主化方法对蛋白质语言模型进行优化。
Proc Natl Acad Sci U S A. 2024 Jun 25;121(26):e2405840121. doi: 10.1073/pnas.2405840121. Epub 2024 Jun 20.
4
LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot:一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.
5
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。
J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.
6
Protein language models meet reduced amino acid alphabets.蛋白质语言模型与简化的氨基酸字母表相遇。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.
7
Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.研究非标准预训练的核苷酸序列上的 BERT 模型,并评估不同的 k-mer 嵌入。
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad617.
8
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
9
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
10
A self-supervised language model selection strategy for biomedical question answering.一种用于生物医学问答的自监督语言模型选择策略。
J Biomed Inform. 2023 Oct;146:104486. doi: 10.1016/j.jbi.2023.104486. Epub 2023 Sep 16.

引用本文的文献

1
Regulating genome language models: navigating policy challenges at the intersection of AI and genetics.规范基因组语言模型:应对人工智能与遗传学交叉领域的政策挑战
Hum Genet. 2025 Sep 16. doi: 10.1007/s00439-025-02768-4.
2
Creating interpretable deep learning models to identify species using environmental DNA sequences.创建可解释的深度学习模型以利用环境DNA序列识别物种。
Sci Rep. 2025 Jul 28;15(1):27436. doi: 10.1038/s41598-025-09846-7.

本文引用的文献

1
CodonBERT large language model for mRNA vaccines.基于 CodonBERT 的 mRNA 疫苗大语言模型。
Genome Res. 2024 Aug 20;34(7):1027-1035. doi: 10.1101/gr.278870.123.
2
DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.
3
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
4
UniProt: the Universal Protein Knowledgebase in 2023.UniProt:2023 年的通用蛋白质知识库。
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052.
5
NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning.NetSurfP-3.0:通过蛋白质语言模型和深度学习实现蛋白质结构特征的准确快速预测。
Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.
6
Codon usage bias.密码子使用偏好。
Mol Biol Rep. 2022 Jan;49(1):539-565. doi: 10.1007/s11033-021-06749-4. Epub 2021 Nov 25.
7
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
8
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
9
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
10
Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework.使用深度半监督学习框架提高单序列预测方法的准确性。
Bioinformatics. 2021 Nov 5;37(21):3744-3751. doi: 10.1093/bioinformatics/btab491.