• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DistilProtBert:一种经过蒸馏的蛋白质语言模型,用于区分真实蛋白质与其随机打乱的对应物。

DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts.

机构信息

The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel.

出版信息

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii95-ii98. doi: 10.1093/bioinformatics/btac474.

DOI:10.1093/bioinformatics/btac474
PMID:36124789
Abstract

SUMMARY

Recently, deep learning models, initially developed in the field of natural language processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, 'distilled' models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine-learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91 and 0.87, respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e. shuffled sequences classified as proteins by DistilProtBert), we may be able to identify de novo potential natural-like proteins based on random shuffling of amino acid sequences.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yarongef/DistilProtBert.

摘要

摘要

最近,最初在自然语言处理 (NLP) 领域开发的深度学习模型成功地应用于分析蛋白质序列。这些模型的一个主要缺点是它们需要拟合的参数数量和所需的计算资源。最近,使用学生和教师网络概念的“蒸馏”模型已在 NLP 中得到广泛应用。在这里,我们将这个概念应用于蛋白质序列分析问题,开发了 DistilProtBert,这是成功的 ProtBert 模型的蒸馏版本。通过实施这种方法,我们将网络的大小和运行时间减少了 50%,相对于 ProtBert 模型,预训练所需的计算资源减少了 98%。使用两个已发布的任务,我们表明蒸馏模型的性能接近全模型的性能。我们接下来测试了 DistilProtBert 区分真实和随机蛋白质序列的能力。如果组成在单核苷酸、二核苷酸和三核苷酸氨基酸水平上保持不变,那么这个任务是极具挑战性的。事实上,传统的机器学习算法在这个任务上存在困难。在这里,我们表明 DistilProtBert 在单核苷酸、二核苷酸甚至三核苷酸混合版本的人类蛋白质组上表现非常出色,AUC 分别为 0.92、0.91 和 0.87。最后,我们建议通过检查少量的假阳性分类(即通过 DistilProtBert 分类为蛋白质的混合序列),我们可能能够根据氨基酸序列的随机混合识别新的潜在自然样蛋白质。

可用性和实现

https://github.com/yarongef/DistilProtBert。

相似文献

1
DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts.DistilProtBert:一种经过蒸馏的蛋白质语言模型,用于区分真实蛋白质与其随机打乱的对应物。
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii95-ii98. doi: 10.1093/bioinformatics/btac474.
2
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
3
Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model.蛋白质序列的全局向量表示及其在使用多粒度级联森林模型预测自相互作用蛋白中的应用。
Genes (Basel). 2019 Nov 12;10(11):924. doi: 10.3390/genes10110924.
4
Recurrent Deep Network Models for Clinical NLP Tasks: Use Case with Sentence Boundary Disambiguation.用于临床自然语言处理任务的循环深度网络模型:句子边界消歧用例
Stud Health Technol Inform. 2019 Aug 21;264:198-202. doi: 10.3233/SHTI190211.
5
Enhanced identification of membrane transport proteins: a hybrid approach combining ProtBERT-BFD and convolutional neural networks.增强膜转运蛋白的鉴定:结合 ProtBERT-BFD 和卷积神经网络的混合方法。
J Integr Bioinform. 2023 Jul 28;20(2). doi: 10.1515/jib-2022-0055. eCollection 2023 Jun 1.
6
Masked Language Modeling for Resource Constrained Biological Natural Language Processing.掩蔽语言模型在资源受限的生物自然语言处理中的应用。
Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-5. doi: 10.1109/EMBC40787.2023.10340499.
7
CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach.胶原转换器:使用自然语言处理方法预测胶原三螺旋热稳定性的端到端转换器模型。
ACS Biomater Sci Eng. 2022 Oct 10;8(10):4301-4310. doi: 10.1021/acsbiomaterials.2c00737. Epub 2022 Sep 23.
8
Survey of Protein Sequence Embedding Models.蛋白质序列嵌入模型调查。
Int J Mol Sci. 2023 Feb 14;24(4):3775. doi: 10.3390/ijms24043775.
9
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
10
Generating interacting protein sequences using domain-to-domain translation.使用域到域翻译生成相互作用的蛋白质序列。
Bioinformatics. 2023 Jul 1;39(7). doi: 10.1093/bioinformatics/btad401.

引用本文的文献

1
NeuroScale: evolutional scale-based protein language models enable prediction of neuropeptides.NeuroScale:基于进化尺度的蛋白质语言模型可实现神经肽预测。
BMC Biol. 2025 May 28;23(1):142. doi: 10.1186/s12915-025-02243-6.
2
A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks.使用多个机器学习框架对病毒逃逸模型语言进行的系统评估。
J R Soc Interface. 2025 Apr;22(225):20240598. doi: 10.1098/rsif.2024.0598. Epub 2025 Apr 30.
3
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions.
用于蛋白质 - 核酸相互作用建模的大规模多组学生物序列变换器
ArXiv. 2025 Apr 1:arXiv:2408.16245v3.
4
bindNode24: Competitive binding residue prediction with 60 % smaller model.bindNode24:使用小60%的模型进行竞争性结合残基预测。
Comput Struct Biotechnol J. 2025 Mar 11;27:1060-1066. doi: 10.1016/j.csbj.2025.02.042. eCollection 2025.
5
SELFprot: Effective and Efficient Multitask Finetuning Methods for Protein Parameter Prediction.SELFprot:用于蛋白质参数预测的高效多任务微调方法
J Chem Inf Model. 2025 Apr 14;65(7):3226-3238. doi: 10.1021/acs.jcim.4c02230. Epub 2025 Mar 17.
6
EuDockScore: Euclidean graph neural networks for scoring protein-protein interfaces.EuDockScore:用于打分蛋白质-蛋白质界面的欧几里得图神经网络。
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae636.
7
Accurate and efficient protein embedding using multi-teacher distillation learning.多教师蒸馏学习在蛋白质嵌入中的准确高效应用。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae567.
8
PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models.PowerNovo:基于Transformer 和 BERT 模型集的串联质谱新肽测序。
Sci Rep. 2024 Jul 1;14(1):15000. doi: 10.1038/s41598-024-65861-0.
9
Application of Transformers in Cheminformatics.Transformer 在化学信息学中的应用。
J Chem Inf Model. 2024 Jun 10;64(11):4392-4409. doi: 10.1021/acs.jcim.3c02070. Epub 2024 May 30.
10
DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning.DLM-DTI:一种基于提示学习的药物-靶点相互作用预测双语模型。
J Cheminform. 2024 Feb 1;16(1):14. doi: 10.1186/s13321-024-00808-1.