Suppr超能文献

胶原转换器:使用自然语言处理方法预测胶原三螺旋热稳定性的端到端转换器模型。

CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach.

机构信息

Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.

Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.

出版信息

ACS Biomater Sci Eng. 2022 Oct 10;8(10):4301-4310. doi: 10.1021/acsbiomaterials.2c00737. Epub 2022 Sep 23.

Abstract

Collagen is one of the most important structural proteins in biology, and its structural hierarchy plays a crucial role in many mechanically important biomaterials. Here, we demonstrate how transformer models can be used to predict, directly from the primary amino acid sequence, the thermal stability of collagen triple helices, measured via the melting temperature . We report two distinct transformer architectures to compare performance. First, we train a small transformer model from scratch, using our collagen data set featuring only 633 sequence-to- pairings. Second, we use a large pretrained transformer model, ProtBERT, and fine-tune it for a particular downstream task by utilizing sequence-to- pairings, using a deep convolutional network to translate natural language processing BERT embeddings into required features. Both the small transformer model and the fine-tuned ProtBERT model have similar values of test data ( = 0.84 vs 0.79, respectively), but the ProtBERT is a much larger pretrained model that may not always be applicable for other biological or biomaterials questions. Specifically, we show that the small transformer model requires only 0.026% of the number of parameters compared to the much larger model but reaches almost the same accuracy for the test set. We compare the performance of both models against 71 newly published sequences for which has been obtained as a validation set and find reasonable agreement, with ProtBERT outperforming the small transformer model. The results presented here are, to our best knowledge, the first demonstration of the use of transformer models for relatively small data sets and for the prediction of specific biophysical properties of interest. We anticipate that the work presented here serves as a starting point for transformer models to be applied to other biophysical problems.

摘要

胶原蛋白是生物学中最重要的结构蛋白之一,其结构层次在许多机械重要的生物材料中起着至关重要的作用。在这里,我们展示了如何使用转换器模型直接从初级氨基酸序列预测胶原蛋白三螺旋的热稳定性,通过熔点来测量。我们报告了两种不同的转换器架构来比较性能。首先,我们从头开始训练一个小的转换器模型,使用我们的胶原蛋白数据集,其中仅包含 633 个序列对。其次,我们使用一个大型预训练的转换器模型 ProtBERT,并通过使用序列对将自然语言处理 BERT 嵌入转换为所需的特征,使用深度卷积网络对其进行特定下游任务的微调。小型转换器模型和微调后的 ProtBERT 模型的测试数据值相似(分别为 0.84 和 0.79),但 ProtBERT 是一个更大的预训练模型,可能并不总是适用于其他生物学或生物材料问题。具体来说,我们表明,与更大的模型相比,小型转换器模型仅需要参数的 0.026%,但对于测试集的准确性几乎相同。我们将这两种模型的性能与作为验证集获得的 71 个新发布序列进行了比较,发现结果合理一致,ProtBERT 优于小型转换器模型。据我们所知,这里提出的结果是首次使用转换器模型对相对较小的数据集进行预测,并对感兴趣的特定生物物理特性进行预测。我们预计,这里提出的工作将成为转换器模型应用于其他生物物理问题的起点。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验