Du Zhenjiao, Caragea Doina, Guo Xiaolong, Li Yonghui
Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.
Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA.
bioRxiv. 2025 Jul 4:2025.04.08.647838. doi: 10.1101/2025.04.08.647838.
Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model-PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)-were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT.
蛋白质语言模型(pLMs)已被广泛应用于各种与蛋白质和肽相关的下游任务,并展现出了良好的性能。然而,短肽在常用的pLM训练数据集中的占比明显不足。例如,在UniProt参考簇(UniRef)中,只有2.8%的序列包含少于50个残基,这可能会限制pLMs在肽特异性应用中的有效性。在此,我们提出了PepBERT,一种专门为编码肽序列而设计的轻量级高效肽语言模型。该模型的两个版本——PepBERT-large(490万个参数)和PepBERT-small(186万个参数)——使用四个自定义肽数据集从头开始进行预训练,并在九个与肽相关的下游预测任务上进行了评估。在9个数据集中的8个上,两个PepBERT模型的性能均优于或与具有750万个参数的基准模型ESM-2相当。总体而言,PepBERT为下游应用生成高质量肽表征提供了一种紧凑而有效的解决方案。通过能够更准确地表征和预测生物活性肽,PepBERT可以加速具有促进健康特性的食物源生物活性肽的发现,支持可持续功能性食品的开发以及食品加工副产品的高附加值利用。PepBERT的数据集、源代码、预训练模型以及使用教程可在https://github.com/dzjxzyd/PepBERT获取。