PeptideBERT：一种基于 Transformer 的用于预测肽性质的语言模型。

PeptideBERT: A Language Model Based on Transformers for Peptide Property Prediction.

机构信息

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.

Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.

出版信息

J Phys Chem Lett. 2023 Nov 23;14(46):10427-10434. doi: 10.1021/acs.jpclett.3c02398. Epub 2023 Nov 13.

DOI:10.1021/acs.jpclett.3c02398

PMID:37956397

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10683064/

Abstract

Recent advances in language models have enabled the protein modeling community with a powerful tool that uses transformers to represent protein sequences as text. This breakthrough enables a sequence-to-property prediction for peptides without relying on explicit structural data. Inspired by the recent progress in the field of large language models, we present PeptideBERT, a protein language model specifically tailored for predicting essential peptide properties such as hemolysis, solubility, and nonfouling. The PeptideBERT utilizes the ProtBERT pretrained transformer model with 12 attention heads and 12 hidden layers. Through fine-tuning the pretrained model for the three downstream tasks, our model is state of the art (SOTA) in predicting hemolysis, which is crucial for determining a peptide's potential to induce red blood cells as well as nonfouling properties. Leveraging primarily shorter sequences and a data set with negative samples predominantly associated with insoluble peptides, our model showcases remarkable performance.

摘要

近年来，语言模型的发展为蛋白质建模社区提供了一个强大的工具，该工具使用转换器将蛋白质序列表示为文本。这一突破使我们能够在不依赖于明确结构数据的情况下，对肽进行序列到属性的预测。受大型语言模型领域最新进展的启发，我们提出了 PeptideBERT，这是一种专门针对预测必需肽性质（如溶血、溶解度和非黏附性）的蛋白质语言模型。PeptideBERT 使用 ProtBERT 预先训练的转换器模型，具有 12 个注意力头和 12 个隐藏层。通过对三个下游任务进行微调，我们的模型在预测溶血方面达到了最先进的水平（SOTA），这对于确定肽诱导红细胞的潜力以及非黏附性性质至关重要。我们的模型主要利用较短的序列和一个主要包含与不溶性肽相关的负样本的数据集，展示了出色的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1482/10683064/bdda15b57aac/jz3c02398_0001.jpg

相似文献

PeptideBERT: A Language Model Based on Transformers for Peptide Property Prediction.PeptideBERT：一种基于 Transformer 的用于预测肽性质的语言模型。

J Phys Chem Lett. 2023 Nov 23;14(46):10427-10434. doi: 10.1021/acs.jpclett.3c02398. Epub 2023 Nov 13.

AMPDeep: hemolytic activity prediction of antimicrobial peptides using transfer learning.AMPDeeP：基于迁移学习的抗菌肽溶血活性预测。

BMC Bioinformatics. 2022 Sep 26;23(1):389. doi: 10.1186/s12859-022-04952-z.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn：一个基于 Transformer 的模型的医学语言理解工具包。

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

AggBERT: Best in Class Prediction of Hexapeptide Amyloidogenesis with a Semi-Supervised ProtBERT Model.AggBERT：使用半监督 ProtBERT 模型进行六肽淀粉样变预测的最佳表现。

J Chem Inf Model. 2023 Sep 25;63(18):5727-5733. doi: 10.1021/acs.jcim.3c00817. Epub 2023 Aug 8.

A comparative study of pretrained language models for long clinical text.基于预训练语言模型的长临床文本比较研究

J Am Med Inform Assoc. 2023 Jan 18;30(2):340-347. doi: 10.1093/jamia/ocac225.

Transformer-based deep learning for predicting protein properties in the life sciences.基于 Transformer 的深度学习在生命科学中预测蛋白质性质。

Elife. 2023 Jan 18;12:e82819. doi: 10.7554/eLife.82819.

CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach.胶原转换器：使用自然语言处理方法预测胶原三螺旋热稳定性的端到端转换器模型。

ACS Biomater Sci Eng. 2022 Oct 10;8(10):4301-4310. doi: 10.1021/acsbiomaterials.2c00737. Epub 2022 Sep 23.

Automated ICD coding using extreme multi-label long text transformer-based models.基于极端多标签长文本转换器的自动 ICD 编码。

Artif Intell Med. 2023 Oct;144:102662. doi: 10.1016/j.artmed.2023.102662. Epub 2023 Sep 7.

Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。

Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

引用本文的文献

MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models.MOFGPT：使用语言模型进行金属有机框架的生成式设计。

J Chem Inf Model. 2025 Sep 8;65(17):9049-9060. doi: 10.1021/acs.jcim.5c01625. Epub 2025 Aug 28.

BPA: a BERT-based priority annotation strategy for assessing the rationality of aquatic algal protein sequences.BPA：一种基于BERT的用于评估水藻蛋白质序列合理性的优先级注释策略。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf401.

Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification.蛋白质结构-功能关系：一种用于反应坐标识别的核主成分分析方法。

J Chem Theory Comput. 2025 Jul 22;21(14):7122-7130. doi: 10.1021/acs.jctc.5c00483. Epub 2025 Jul 14.

A review of transformer models in drug discovery and beyond.药物发现及其他领域中变压器模型综述。

J Pharm Anal. 2025 Jun;15(6):101081. doi: 10.1016/j.jpha.2024.101081. Epub 2024 Aug 30.

PepBERT: Lightweight language models for bioactive peptide representation.PepBERT：用于生物活性肽表征的轻量级语言模型。

bioRxiv. 2025 Jul 4:2025.04.08.647838. doi: 10.1101/2025.04.08.647838.

AI-Driven Antimicrobial Peptide Discovery: Mining and Generation.人工智能驱动的抗菌肽发现：挖掘与生成

Acc Chem Res. 2025 Jun 17;58(12):1831-1846. doi: 10.1021/acs.accounts.0c00594. Epub 2025 Jun 3.

Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM.使用ProtBERT和支持向量机识别转录因子的DNA甲基化偏好性。

PLoS Comput Biol. 2025 May 13;21(5):e1012513. doi: 10.1371/journal.pcbi.1012513. eCollection 2025 May.

Peptide Property Prediction for Mass Spectrometry Using AI: An Introduction to State of the Art Models.使用人工智能进行质谱肽特性预测：最新模型介绍

Proteomics. 2025 May;25(9-10):e202400398. doi: 10.1002/pmic.202400398. Epub 2025 Apr 10.

NA_mCNN: Classification of Sodium Transporters in Membrane Proteins by Integrating Multi-Window Deep Learning and ProtTrans for Their Therapeutic Potential.NA_mCNN：通过整合多窗口深度学习和ProtTrans对膜蛋白中的钠转运体进行分类以挖掘其治疗潜力

J Proteome Res. 2025 May 2;24(5):2324-2335. doi: 10.1021/acs.jproteome.4c00884. Epub 2025 Apr 7.

Leveraging large language models for peptide antibiotic design.利用大语言模型进行肽类抗生素设计。

Cell Rep Phys Sci. 2025 Jan 15;6(1). doi: 10.1016/j.xcrp.2024.102359. Epub 2024 Dec 31.

本文引用的文献

GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models.GPCR-BERT：使用蛋白质语言模型解释 G 蛋白偶联受体的序列设计。

J Chem Inf Model. 2024 Feb 26;64(4):1134-1144. doi: 10.1021/acs.jcim.3c01706. Epub 2024 Feb 10.

Activity Map and Transition Pathways of G Protein-Coupled Receptor Revealed by Machine Learning.机器学习揭示的 G 蛋白偶联受体的活动图谱和跃迁途径。

J Chem Inf Model. 2023 Apr 24;63(8):2296-2304. doi: 10.1021/acs.jcim.3c00032. Epub 2023 Apr 10.

Serverless Prediction of Peptide Properties with Recurrent Neural Networks.基于递归神经网络的肽性质无服务器预测。

J Chem Inf Model. 2023 Apr 24;63(8):2546-2553. doi: 10.1021/acs.jcim.2c01317. Epub 2023 Apr 3.

Prediction of GPCR activity using machine learning.使用机器学习预测G蛋白偶联受体（GPCR）活性。

Comput Struct Biotechnol J. 2022 May 18;20:2564-2573. doi: 10.1016/j.csbj.2022.05.016. eCollection 2022.

DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks.DSResSol：一种基于序列的溶解度预测模型，该模型使用了扩张挤压残差网络创建。

Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555.

Machine learning designs non-hemolytic antimicrobial peptides.机器学习设计非溶血性抗菌肽。

Chem Sci. 2021 Jun 7;12(26):9221-9232. doi: 10.1039/d1sc01713f. eCollection 2021 Jul 7.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations.通过深度生成模型和分子动力学模拟加速抗菌药物的发现。

Nat Biomed Eng. 2021 Jun;5(6):613-623. doi: 10.1038/s41551-021-00689-x. Epub 2021 Mar 11.

SoluProt: prediction of soluble protein expression in Escherichia coli.SoluProt：大肠杆菌中可溶性蛋白质表达的预测

Bioinformatics. 2021 Apr 9;37(1):23-28. doi: 10.1093/bioinformatics/btaa1102.

AnOxPePred: using deep learning for the prediction of antioxidative properties of peptides.AnOxPePred：使用深度学习预测肽的抗氧化性质。

Sci Rep. 2020 Dec 8;10(1):21471. doi: 10.1038/s41598-020-78319-w.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

PeptideBERT：一种基于 Transformer 的用于预测肽性质的语言模型。

PeptideBERT: A Language Model Based on Transformers for Peptide Property Prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献