Department of Chemistry, University of Pennsylvania, Philadelphia, Pennsylvania 19104, United States.
J Chem Inf Model. 2023 Sep 25;63(18):5727-5733. doi: 10.1021/acs.jcim.3c00817. Epub 2023 Aug 8.
The prediction of peptide amyloidogenesis is a challenging problem in the field of protein folding. Large language models, such as the ProtBERT model, have recently emerged as powerful tools in analyzing protein sequences for applications, such as predicting protein structure and function. In this article, we describe the use of a semisupervised and fine-tuned ProtBERT model to predict peptide amyloidogenesis from sequences alone. Our approach, which we call AggBERT, achieved state-of-the-art performance, demonstrating the potential for large language models to improve the accuracy and speed of amyloid fibril prediction over simple heuristics or structure-based approaches. This work highlights the transformative potential of machine learning and large language models in the fields of chemical biology and biomedicine.
肽淀粉样生成的预测是蛋白质折叠领域的一个具有挑战性的问题。大型语言模型,如 ProtBERT 模型,最近作为分析蛋白质序列的强大工具出现,例如预测蛋白质结构和功能。在本文中,我们描述了使用半监督和微调 ProtBERT 模型仅从序列预测肽淀粉样生成的方法。我们的方法,我们称之为 AggBERT,达到了最先进的性能,证明了大型语言模型在提高淀粉样纤维预测的准确性和速度方面具有超越简单启发式或基于结构方法的潜力。这项工作突出了机器学习和大型语言模型在化学生物学和生物医学领域的变革潜力。