PETA：评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

作者信息

Tan Yang, Li Mingchen, Zhou Ziyi, Tan Pan, Yu Huiqun, Fan Guisheng, Hong Liang

机构信息

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.

Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.

出版信息

J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.

DOI:10.1186/s13321-024-00884-3

PMID:39095917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297785/

Abstract

Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.

摘要

蛋白质语言模型（PLMs）在蛋白质表示学习中发挥着主导作用。大多数现有的PLMs将蛋白质视为由20种天然氨基酸组成的序列。这种表示方法的问题在于，它只是将蛋白质序列简单地划分为单个氨基酸序列，而忽略了某些残基经常一起出现的事实。因此，将氨基酸视为孤立的标记是不合适的。相反，PLMs应该将频繁出现的氨基酸组合识别为单个标记。在本研究中，我们使用字节对编码算法和一元语法来构建用于蛋白质序列标记化的高级残基词汇表，并且我们已经表明，与使用简单词汇表训练的模型相比，使用这些高级词汇表预训练的PLMs在下游任务中表现出更优的性能。此外，我们引入了PETA，这是一个用于系统评估PLMs的综合基准。我们发现，包含50个和200个元素的词汇表可实现最佳性能。我们的代码、模型权重和数据集可在https://github.com/ginnm/ProteinPretraining获取。科学贡献：本研究引入了先进的蛋白质序列标记化分析，利用了字节对编码算法和一元语法。通过将频繁出现的氨基酸组合识别为单个标记，我们提出的方法提高了PLMs在下游任务中的性能。此外，我们提出了PETA，这是一个用于系统评估PLMs的新综合基准，表明50个和200个元素的词汇表具有最佳性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c880/11297785/75cefd877eac/13321_2024_884_Fig1_HTML.jpg

相似文献

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

J Chem Inf Model. 2021 Apr 26;61(4):1560-1569. doi: 10.1021/acs.jcim.0c01127. Epub 2021 Mar 14.

Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.

Mol Inform. 2024 Mar;43(3):e202300249. doi: 10.1002/minf.202300249. Epub 2024 Jan 23.

Effect of tokenization on transformers for biological sequences.

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae529.

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.

J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.

Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models.

J Chem Inf Model. 2024 Aug 26;64(16):6338-6349. doi: 10.1021/acs.jcim.4c00689. Epub 2024 Aug 7.

Protein language models meet reduced amino acid alphabets.

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure.

bioRxiv. 2024 May 13:2023.08.06.552203. doi: 10.1101/2023.08.06.552203.

引用本文的文献

Genome language modeling (GLM): a beginner's cheat sheet.

Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.

Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability.

Elife. 2025 May 2;13:RP98033. doi: 10.7554/eLife.98033.

AI-enabled alkaline-resistant evolution of protein to apply in mass production.

Elife. 2025 Feb 19;13:RP102788. doi: 10.7554/eLife.102788.

Protein engineering in the deep learning era.

mLife. 2024 Dec 26;3(4):477-491. doi: 10.1002/mlf2.12157. eCollection 2024 Dec.

本文引用的文献

Protein Engineering with Lightweight Graph Denoising Neural Networks.

J Chem Inf Model. 2024 May 13;64(9):3650-3661. doi: 10.1021/acs.jcim.4c00036. Epub 2024 Apr 17.

Convolutions are competitive with transformers for protein sequence pretraining.

Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.

Protein language models meet reduced amino acid alphabets.

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.

ProGen2: Exploring the boundaries of protein language models.

Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.

Masked inverse folding with sequence transfer for protein representation learning.

Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.

Evolutionary-scale prediction of atomic-level protein structure with a language model.

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering.

J Cheminform. 2023 Feb 3;15(1):12. doi: 10.1186/s13321-023-00688-x.

Large language models generate functional protein sequences across diverse families.

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

Light attention predicts protein location from the language of life.

Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.

SoluProtMut: A manually curated database of protein solubility changes upon mutations.

Comput Struct Biotechnol J. 2022 Nov 9;20:6339-6347. doi: 10.1016/j.csbj.2022.11.009. eCollection 2022.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PETA：评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

作者信息

Tan Yang, Li Mingchen, Zhou Ziyi, Tan Pan, Yu Huiqun, Fan Guisheng, Hong Liang

机构信息

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.

Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.

出版信息

J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.

DOI:10.1186/s13321-024-00884-3

PMID:39095917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297785/

Abstract

摘要

PETA：评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

PETA：评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献