• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PepBERT:用于生物活性肽表征的轻量级语言模型。

PepBERT: Lightweight language models for bioactive peptide representation.

作者信息

Du Zhenjiao, Caragea Doina, Guo Xiaolong, Li Yonghui

机构信息

Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.

Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA.

出版信息

bioRxiv. 2025 Jul 4:2025.04.08.647838. doi: 10.1101/2025.04.08.647838.

DOI:10.1101/2025.04.08.647838
PMID:40631260
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12236832/
Abstract

Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model-PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)-were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT.

摘要

蛋白质语言模型(pLMs)已被广泛应用于各种与蛋白质和肽相关的下游任务,并展现出了良好的性能。然而,短肽在常用的pLM训练数据集中的占比明显不足。例如,在UniProt参考簇(UniRef)中,只有2.8%的序列包含少于50个残基,这可能会限制pLMs在肽特异性应用中的有效性。在此,我们提出了PepBERT,一种专门为编码肽序列而设计的轻量级高效肽语言模型。该模型的两个版本——PepBERT-large(490万个参数)和PepBERT-small(186万个参数)——使用四个自定义肽数据集从头开始进行预训练,并在九个与肽相关的下游预测任务上进行了评估。在9个数据集中的8个上,两个PepBERT模型的性能均优于或与具有750万个参数的基准模型ESM-2相当。总体而言,PepBERT为下游应用生成高质量肽表征提供了一种紧凑而有效的解决方案。通过能够更准确地表征和预测生物活性肽,PepBERT可以加速具有促进健康特性的食物源生物活性肽的发现,支持可持续功能性食品的开发以及食品加工副产品的高附加值利用。PepBERT的数据集、源代码、预训练模型以及使用教程可在https://github.com/dzjxzyd/PepBERT获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e9/12236832/34483a0dded5/nihpp-2025.04.08.647838v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e9/12236832/2897f8bc0c8d/nihpp-2025.04.08.647838v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e9/12236832/34483a0dded5/nihpp-2025.04.08.647838v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e9/12236832/2897f8bc0c8d/nihpp-2025.04.08.647838v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e9/12236832/34483a0dded5/nihpp-2025.04.08.647838v2-f0002.jpg

相似文献

1
PepBERT: Lightweight language models for bioactive peptide representation.PepBERT:用于生物活性肽表征的轻量级语言模型。
bioRxiv. 2025 Jul 4:2025.04.08.647838. doi: 10.1101/2025.04.08.647838.
2
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
3
Short-Term Memory Impairment短期记忆障碍
4
The quantity, quality and findings of network meta-analyses evaluating the effectiveness of GLP-1 RAs for weight loss: a scoping review.评估胰高血糖素样肽-1受体激动剂(GLP-1 RAs)减肥效果的网状Meta分析的数量、质量及结果:一项范围综述
Health Technol Assess. 2025 Jun 25:1-73. doi: 10.3310/SKHT8119.
5
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
6
Systemic treatments for metastatic cutaneous melanoma.转移性皮肤黑色素瘤的全身治疗
Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.
7
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
8
Surveillance of Barrett's oesophagus: exploring the uncertainty through systematic review, expert workshop and economic modelling.巴雷特食管的监测:通过系统评价、专家研讨会和经济模型探索不确定性
Health Technol Assess. 2006 Mar;10(8):1-142, iii-iv. doi: 10.3310/hta10080.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
10
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.

本文引用的文献

1
pLM4CPPs: Protein Language Model-Based Predictor for Cell Penetrating Peptides.pLM4CPPs:基于蛋白质语言模型的细胞穿透肽预测器。
J Chem Inf Model. 2025 Feb 10;65(3):1128-1139. doi: 10.1021/acs.jcim.4c01338. Epub 2025 Jan 29.
2
FAPM: functional annotation of proteins using multimodal models beyond structural modeling.FAPM:使用超越结构建模的多模态模型对蛋白质进行功能注释。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae680.
3
PeptideBERT: A Language Model Based on Transformers for Peptide Property Prediction.
PeptideBERT:一种基于 Transformer 的用于预测肽性质的语言模型。
J Phys Chem Lett. 2023 Nov 23;14(46):10427-10434. doi: 10.1021/acs.jpclett.3c02398. Epub 2023 Nov 13.
4
UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity.UniDL4BioPep:用于肽生物活性二元分类的通用深度学习架构。
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad135.
5
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
6
Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences.通过挖掘整个肽序列空间的机器学习管道识别有效的抗菌肽。
Nat Biomed Eng. 2023 Jun;7(6):797-810. doi: 10.1038/s41551-022-00991-2. Epub 2023 Jan 12.
7
Leveraging molecular structure and bioactivity with chemical language models for de novo drug design.利用分子结构和生物活性与化学语言模型进行从头药物设计。
Nat Commun. 2023 Jan 7;14(1):114. doi: 10.1038/s41467-022-35692-6.
8
NetSolP: predicting protein solubility in Escherichia coli using language models.NetSolP:使用语言模型预测大肠杆菌中的蛋白质可溶性。
Bioinformatics. 2022 Jan 27;38(4):941-946. doi: 10.1093/bioinformatics/btab801.
9
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
10
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.