AggBERT：使用半监督 ProtBERT 模型进行六肽淀粉样变预测的最佳表现。

AggBERT: Best in Class Prediction of Hexapeptide Amyloidogenesis with a Semi-Supervised ProtBERT Model.

机构信息

Department of Chemistry, University of Pennsylvania, Philadelphia, Pennsylvania 19104, United States.

出版信息

J Chem Inf Model. 2023 Sep 25;63(18):5727-5733. doi: 10.1021/acs.jcim.3c00817. Epub 2023 Aug 8.

DOI:10.1021/acs.jcim.3c00817

PMID:37552230

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10777593/

Abstract

The prediction of peptide amyloidogenesis is a challenging problem in the field of protein folding. Large language models, such as the ProtBERT model, have recently emerged as powerful tools in analyzing protein sequences for applications, such as predicting protein structure and function. In this article, we describe the use of a semisupervised and fine-tuned ProtBERT model to predict peptide amyloidogenesis from sequences alone. Our approach, which we call AggBERT, achieved state-of-the-art performance, demonstrating the potential for large language models to improve the accuracy and speed of amyloid fibril prediction over simple heuristics or structure-based approaches. This work highlights the transformative potential of machine learning and large language models in the fields of chemical biology and biomedicine.

摘要

肽淀粉样生成的预测是蛋白质折叠领域的一个具有挑战性的问题。大型语言模型，如 ProtBERT 模型，最近作为分析蛋白质序列的强大工具出现，例如预测蛋白质结构和功能。在本文中，我们描述了使用半监督和微调 ProtBERT 模型仅从序列预测肽淀粉样生成的方法。我们的方法，我们称之为 AggBERT，达到了最先进的性能，证明了大型语言模型在提高淀粉样纤维预测的准确性和速度方面具有超越简单启发式或基于结构方法的潜力。这项工作突出了机器学习和大型语言模型在化学生物学和生物医学领域的变革潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d5c8/10777593/d40e9ba578ee/nihms-1950881-f0002.jpg

相似文献

AggBERT: Best in Class Prediction of Hexapeptide Amyloidogenesis with a Semi-Supervised ProtBERT Model.AggBERT：使用半监督 ProtBERT 模型进行六肽淀粉样变预测的最佳表现。

J Chem Inf Model. 2023 Sep 25;63(18):5727-5733. doi: 10.1021/acs.jcim.3c00817. Epub 2023 Aug 8.

PeptideBERT: A Language Model Based on Transformers for Peptide Property Prediction.PeptideBERT：一种基于 Transformer 的用于预测肽性质的语言模型。

J Phys Chem Lett. 2023 Nov 23;14(46):10427-10434. doi: 10.1021/acs.jpclett.3c02398. Epub 2023 Nov 13.

MultiCon: A Semi-Supervised Approach for Predicting Drug Function from Chemical Structure Analysis.多模态融合预测：一种从化学结构分析预测药物功能的半监督方法

J Chem Inf Model. 2020 Dec 28;60(12):5995-6006. doi: 10.1021/acs.jcim.0c00801. Epub 2020 Nov 3.

AMPDeep: hemolytic activity prediction of antimicrobial peptides using transfer learning.AMPDeeP：基于迁移学习的抗菌肽溶血活性预测。

BMC Bioinformatics. 2022 Sep 26;23(1):389. doi: 10.1186/s12859-022-04952-z.

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.基于大规模蛋白质语言模型的监督方法的集成学习在蛋白质突变效应预测中的应用。

Int J Mol Sci. 2023 Nov 18;24(22):16496. doi: 10.3390/ijms242216496.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

Biomedical image classification made easier thanks to transfer and semi-supervised learning.得益于迁移学习和半监督学习，生物医学图像分类变得更加容易。

Comput Methods Programs Biomed. 2021 Jan;198:105782. doi: 10.1016/j.cmpb.2020.105782. Epub 2020 Oct 3.

An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.

Landslide susceptibility prediction improvements based on a semi-integrated supervised machine learning model.基于半集成监督机器学习模型的滑坡易发性预测改进

Environ Sci Pollut Res Int. 2023 Apr;30(17):50280-50294. doi: 10.1007/s11356-023-25650-0. Epub 2023 Feb 15.

PepDist: a new framework for protein-peptide binding prediction based on learning peptide distance functions.PepDist：一种基于学习肽距离函数的蛋白质-肽结合预测新框架。

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-7-S1-S3.

引用本文的文献

Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM.使用ProtBERT和支持向量机识别转录因子的DNA甲基化偏好性。

PLoS Comput Biol. 2025 May 13;21(5):e1012513. doi: 10.1371/journal.pcbi.1012513. eCollection 2025 May.

iAmyP: A Multi-view Learning for Amyloidogenic Hexapeptides Identification Based on Sequence Least Squares Programming.iAmyP：基于序列最小二乘规划的淀粉样生成六肽识别多视图学习

Interdiscip Sci. 2025 Jun;17(2):277-292. doi: 10.1007/s12539-024-00666-3. Epub 2024 Nov 15.

PatchProt: hydrophobic patch prediction using protein foundation models.PatchProt：使用蛋白质基础模型进行疏水补丁预测。

Bioinform Adv. 2024 Oct 14;4(1):vbae154. doi: 10.1093/bioadv/vbae154. eCollection 2024.

Chain of Thought Utilization in Large Language Models and Application in Nephrology.大语言模型中的思维链利用及其在肾脏病学中的应用。

Medicina (Kaunas). 2024 Jan 13;60(1):148. doi: 10.3390/medicina60010148.

本文引用的文献

Effects of Mutations and Post-Translational Modifications on α-Synuclein In Vitro Aggregation.突变和翻译后修饰对α-突触核蛋白体外聚集的影响。

J Mol Biol. 2022 Dec 15;434(23):167859. doi: 10.1016/j.jmb.2022.167859. Epub 2022 Oct 19.

Rational design of thioamide peptides as selective inhibitors of cysteine protease cathepsin L.硫代酰胺肽作为半胱氨酸蛋白酶组织蛋白酶L选择性抑制剂的合理设计

Chem Sci. 2021 Jul 19;12(32):10825-10835. doi: 10.1039/d1sc00785h. eCollection 2021 Aug 18.

Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors.基于机器学习的预测模型，用于利用蛋白质光谱和蛋白质描述符分析序列活性关系。

J Biomed Inform. 2022 Apr;128:104016. doi: 10.1016/j.jbi.2022.104016. Epub 2022 Feb 7.

Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks.深度学习模型校准，以提高类别不平衡医学图像分类任务的性能。

PLoS One. 2022 Jan 27;17(1):e0262838. doi: 10.1371/journal.pone.0262838. eCollection 2022.

Biomolecular simulation based machine learning models accurately predict sites of tolerability to the unnatural amino acid acridonylalanine.基于生物分子模拟的机器学习模型能够准确预测非天然氨基酸氮丙啶丙氨酸的耐受性位点。

Sci Rep. 2021 Sep 15;11(1):18406. doi: 10.1038/s41598-021-97965-2.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Identification of a nanomolar affinity α-synuclein fibril imaging probe by ultra-high throughput screening.通过超高通量筛选鉴定纳摩尔亲和力的α-突触核蛋白原纤维成像探针。

Chem Sci. 2020 Sep 10;11(47):12746-12754. doi: 10.1039/d0sc02159h. eCollection 2020 Dec 21.

Rosetta Machine Learning Models Accurately Classify Positional Effects of Thioamides on Proteolysis.罗塞塔机器学习模型准确分类硫代酰胺对蛋白水解的位置效应。

J Phys Chem B. 2020 Sep 17;124(37):8032-8041. doi: 10.1021/acs.jpcb.0c05981. Epub 2020 Sep 1.

Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities.基于结构的机器引导的淀粉样序列空间映射揭示了具有更高溶解度的未知序列簇。

Nat Commun. 2020 Jul 3;11(1):3314. doi: 10.1038/s41467-020-17207-3.

Rosetta custom score functions accurately predict ΔΔG of mutations at protein-protein interfaces using machine learning.罗塞塔自定义评分函数通过机器学习准确预测蛋白质-蛋白质界面突变的 ΔΔG。

Chem Commun (Camb). 2020 Jun 25;56(50):6774-6777. doi: 10.1039/d0cc01959c. Epub 2020 May 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验