Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
J Phys Chem Lett. 2023 Nov 23;14(46):10427-10434. doi: 10.1021/acs.jpclett.3c02398. Epub 2023 Nov 13.
Recent advances in language models have enabled the protein modeling community with a powerful tool that uses transformers to represent protein sequences as text. This breakthrough enables a sequence-to-property prediction for peptides without relying on explicit structural data. Inspired by the recent progress in the field of large language models, we present PeptideBERT, a protein language model specifically tailored for predicting essential peptide properties such as hemolysis, solubility, and nonfouling. The PeptideBERT utilizes the ProtBERT pretrained transformer model with 12 attention heads and 12 hidden layers. Through fine-tuning the pretrained model for the three downstream tasks, our model is state of the art (SOTA) in predicting hemolysis, which is crucial for determining a peptide's potential to induce red blood cells as well as nonfouling properties. Leveraging primarily shorter sequences and a data set with negative samples predominantly associated with insoluble peptides, our model showcases remarkable performance.
近年来,语言模型的发展为蛋白质建模社区提供了一个强大的工具,该工具使用转换器将蛋白质序列表示为文本。这一突破使我们能够在不依赖于明确结构数据的情况下,对肽进行序列到属性的预测。受大型语言模型领域最新进展的启发,我们提出了 PeptideBERT,这是一种专门针对预测必需肽性质(如溶血、溶解度和非黏附性)的蛋白质语言模型。PeptideBERT 使用 ProtBERT 预先训练的转换器模型,具有 12 个注意力头和 12 个隐藏层。通过对三个下游任务进行微调,我们的模型在预测溶血方面达到了最先进的水平(SOTA),这对于确定肽诱导红细胞的潜力以及非黏附性性质至关重要。我们的模型主要利用较短的序列和一个主要包含与不溶性肽相关的负样本的数据集,展示了出色的性能。