• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

作者信息

Tan Yang, Li Mingchen, Zhou Ziyi, Tan Pan, Yu Huiqun, Fan Guisheng, Hong Liang

机构信息

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.

Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.

出版信息

J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.

DOI:10.1186/s13321-024-00884-3
PMID:39095917
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297785/
Abstract

Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.

摘要

蛋白质语言模型(PLMs)在蛋白质表示学习中发挥着主导作用。大多数现有的PLMs将蛋白质视为由20种天然氨基酸组成的序列。这种表示方法的问题在于,它只是将蛋白质序列简单地划分为单个氨基酸序列,而忽略了某些残基经常一起出现的事实。因此,将氨基酸视为孤立的标记是不合适的。相反,PLMs应该将频繁出现的氨基酸组合识别为单个标记。在本研究中,我们使用字节对编码算法和一元语法来构建用于蛋白质序列标记化的高级残基词汇表,并且我们已经表明,与使用简单词汇表训练的模型相比,使用这些高级词汇表预训练的PLMs在下游任务中表现出更优的性能。此外,我们引入了PETA,这是一个用于系统评估PLMs的综合基准。我们发现,包含50个和200个元素的词汇表可实现最佳性能。我们的代码、模型权重和数据集可在https://github.com/ginnm/ProteinPretraining获取。科学贡献:本研究引入了先进的蛋白质序列标记化分析,利用了字节对编码算法和一元语法。通过将频繁出现的氨基酸组合识别为单个标记,我们提出的方法提高了PLMs在下游任务中的性能。此外,我们提出了PETA,这是一个用于系统评估PLMs的新综合基准,表明50个和200个元素的词汇表具有最佳性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/27d7aab67f62/13321_2024_884_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/75cefd877eac/13321_2024_884_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/f2d387a20c73/13321_2024_884_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/3efe1218a445/13321_2024_884_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/27d7aab67f62/13321_2024_884_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/75cefd877eac/13321_2024_884_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/f2d387a20c73/13321_2024_884_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/3efe1218a445/13321_2024_884_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/27d7aab67f62/13321_2024_884_Fig4_HTML.jpg

相似文献

1
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。
J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.
2
SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.SMILES 对编码:一种用于深度学习的数据驱动子结构标记化算法。
J Chem Inf Model. 2021 Apr 26;61(4):1560-1569. doi: 10.1021/acs.jcim.0c01127. Epub 2021 Mar 14.
3
Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.探索基于数据驱动的化学 SMILES 标记方法,以识别关键的蛋白-配体结合部位。
Mol Inform. 2024 Mar;43(3):e202300249. doi: 10.1002/minf.202300249. Epub 2024 Jan 23.
4
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
5
Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.是否仅需基因组语言模型?探索基因组语言模型在蛋白质下游任务中的应用。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae529.
6
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.通过SMILES中的原子分词提高化学语言模型结果的质量。
J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.
7
Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models.简单、高效、可扩展的结构感知适配器提升蛋白质语言模型。
J Chem Inf Model. 2024 Aug 26;64(16):6338-6349. doi: 10.1021/acs.jcim.4c00689. Epub 2024 Aug 7.
8
Protein language models meet reduced amino acid alphabets.蛋白质语言模型与简化的氨基酸字母表相遇。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.
9
LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot:一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.
10
S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure.S-PLM:通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型。
bioRxiv. 2024 May 13:2023.08.06.552203. doi: 10.1101/2023.08.06.552203.

引用本文的文献

1
Genome language modeling (GLM): a beginner's cheat sheet.基因组语言建模(GLM):初学者简易指南。
Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.
2
Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability.面向增强生物活性和热稳定性的语义和几何蛋白质编码
Elife. 2025 May 2;13:RP98033. doi: 10.7554/eLife.98033.
3
AI-enabled alkaline-resistant evolution of protein to apply in mass production.用于大规模生产的人工智能辅助蛋白质抗碱进化。

本文引用的文献

1
Protein Engineering with Lightweight Graph Denoising Neural Networks.用光重量化图去噪神经网络进行蛋白质工程。
J Chem Inf Model. 2024 May 13;64(9):3650-3661. doi: 10.1021/acs.jcim.4c00036. Epub 2024 Apr 17.
2
Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。
Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.
3
Protein language models meet reduced amino acid alphabets.蛋白质语言模型与简化的氨基酸字母表相遇。
Elife. 2025 Feb 19;13:RP102788. doi: 10.7554/eLife.102788.
4
Protein engineering in the deep learning era.深度学习时代的蛋白质工程。
mLife. 2024 Dec 26;3(4):477-491. doi: 10.1002/mlf2.12157. eCollection 2024 Dec.
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.
4
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
5
Masked inverse folding with sequence transfer for protein representation learning.用于蛋白质表示学习的带序列转移的掩码反向折叠
Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.
6
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
7
SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering.SESNet:用于数据高效蛋白质工程的序列-结构特征整合深度学习方法。
J Cheminform. 2023 Feb 3;15(1):12. doi: 10.1186/s13321-023-00688-x.
8
Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。
Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.
9
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
10
SoluProtMut: A manually curated database of protein solubility changes upon mutations.SoluProtMut:一个人工整理的关于突变后蛋白质溶解度变化的数据库。
Comput Struct Biotechnol J. 2022 Nov 9;20:6339-6347. doi: 10.1016/j.csbj.2022.11.009. eCollection 2022.