Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.
Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.
ACS Biomater Sci Eng. 2022 Oct 10;8(10):4301-4310. doi: 10.1021/acsbiomaterials.2c00737. Epub 2022 Sep 23.
Collagen is one of the most important structural proteins in biology, and its structural hierarchy plays a crucial role in many mechanically important biomaterials. Here, we demonstrate how transformer models can be used to predict, directly from the primary amino acid sequence, the thermal stability of collagen triple helices, measured via the melting temperature . We report two distinct transformer architectures to compare performance. First, we train a small transformer model from scratch, using our collagen data set featuring only 633 sequence-to- pairings. Second, we use a large pretrained transformer model, ProtBERT, and fine-tune it for a particular downstream task by utilizing sequence-to- pairings, using a deep convolutional network to translate natural language processing BERT embeddings into required features. Both the small transformer model and the fine-tuned ProtBERT model have similar values of test data ( = 0.84 vs 0.79, respectively), but the ProtBERT is a much larger pretrained model that may not always be applicable for other biological or biomaterials questions. Specifically, we show that the small transformer model requires only 0.026% of the number of parameters compared to the much larger model but reaches almost the same accuracy for the test set. We compare the performance of both models against 71 newly published sequences for which has been obtained as a validation set and find reasonable agreement, with ProtBERT outperforming the small transformer model. The results presented here are, to our best knowledge, the first demonstration of the use of transformer models for relatively small data sets and for the prediction of specific biophysical properties of interest. We anticipate that the work presented here serves as a starting point for transformer models to be applied to other biophysical problems.
胶原蛋白是生物学中最重要的结构蛋白之一,其结构层次在许多机械重要的生物材料中起着至关重要的作用。在这里,我们展示了如何使用转换器模型直接从初级氨基酸序列预测胶原蛋白三螺旋的热稳定性,通过熔点来测量。我们报告了两种不同的转换器架构来比较性能。首先,我们从头开始训练一个小的转换器模型,使用我们的胶原蛋白数据集,其中仅包含 633 个序列对。其次,我们使用一个大型预训练的转换器模型 ProtBERT,并通过使用序列对将自然语言处理 BERT 嵌入转换为所需的特征,使用深度卷积网络对其进行特定下游任务的微调。小型转换器模型和微调后的 ProtBERT 模型的测试数据值相似(分别为 0.84 和 0.79),但 ProtBERT 是一个更大的预训练模型,可能并不总是适用于其他生物学或生物材料问题。具体来说,我们表明,与更大的模型相比,小型转换器模型仅需要参数的 0.026%,但对于测试集的准确性几乎相同。我们将这两种模型的性能与作为验证集获得的 71 个新发布序列进行了比较,发现结果合理一致,ProtBERT 优于小型转换器模型。据我们所知,这里提出的结果是首次使用转换器模型对相对较小的数据集进行预测,并对感兴趣的特定生物物理特性进行预测。我们预计,这里提出的工作将成为转换器模型应用于其他生物物理问题的起点。