两种基于序列和结构的 ML 模型已经学习了蛋白质生物化学的不同方面。

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.

机构信息

Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.

The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA.

出版信息

Sci Rep. 2023 Aug 16;13(1):13280. doi: 10.1038/s41598-023-40247-w.

DOI:10.1038/s41598-023-40247-w

PMID:37587128

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10432456/

Abstract

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

摘要

深度学习模型作为预测蛋白质突变效应或允许突变的方法越来越受到关注。这些目的常用的模型包括大型语言模型（LLM）和 3D 卷积神经网络（CNN）。这两种模型类型具有非常不同的架构，通常在蛋白质的不同表示形式上进行训练。LLM 利用转换器架构，仅在蛋白质序列上进行训练，而 3D CNN 则在局部蛋白质结构的体素化表示上进行训练。虽然两种类型的模型都报告了相当的总体预测准确性，但尚不清楚这些模型在多大程度上可以进行可比的具体预测，以及以类似的方式推广蛋白质生物化学。在这里，我们对两个 LLM 和两个基于结构的模型（CNN）进行了系统比较，并表明不同的模型类型具有不同的优缺点。基于序列和基于结构的模型之间的总体预测准确性相关性不大。总体而言，两种基于结构的模型更擅长预测埋藏的脂肪族和亲脂性残基，而两种 LLM 更擅长预测溶剂暴露的极性和带电氨基酸。最后，我们发现，将个别模型预测作为输入的组合模型可以利用这些个别模型的优势，并显著提高整体预测准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/962d/10432456/0f905bfd0b39/41598_2023_40247_Fig1_HTML.jpg

相似文献

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.两种基于序列和结构的 ML 模型已经学习了蛋白质生物化学的不同方面。

Sci Rep. 2023 Aug 16;13(1):13280. doi: 10.1038/s41598-023-40247-w.

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.两个基于序列和两个基于结构的机器学习模型已经了解了蛋白质生物化学的不同方面。

bioRxiv. 2023 Jul 9:2023.03.20.533508. doi: 10.1101/2023.03.20.533508.

Learning the local landscape of protein structures with convolutional neural networks.用卷积神经网络学习蛋白质结构的局部景观。

J Biol Phys. 2021 Dec;47(4):435-454. doi: 10.1007/s10867-021-09593-6. Epub 2021 Nov 9.

Tpgen: a language model for stable protein design with a specific topology structure.Tpgen：一种具有特定拓扑结构的稳定蛋白质设计语言模型。

BMC Bioinformatics. 2024 Jan 23;25(1):35. doi: 10.1186/s12859-024-05637-5.

Transformer-based deep learning for predicting protein properties in the life sciences.基于 Transformer 的深度学习在生命科学中预测蛋白质性质。

Elife. 2023 Jan 18;12:e82819. doi: 10.7554/eLife.82819.

Improved UNet with Attention for Medical Image Segmentation.基于注意力机制的改进型 UNet 用于医学图像分割。

Sensors (Basel). 2023 Oct 20;23(20):8589. doi: 10.3390/s23208589.

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么：语义信息对视觉表征的影响。

Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling.MABAL：一种用于机器辅助骨龄标注的新型深度学习架构。

J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.

Sequence based residue depth prediction using evolutionary information and predicted secondary structure.基于序列的残基深度预测，利用进化信息和预测的二级结构。

BMC Bioinformatics. 2008 Sep 20;9:388. doi: 10.1186/1471-2105-9-388.

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images.基于 Transformer 的语义分割在从超高分辨率图像中提取建筑物轮廓中的应用。

Sensors (Basel). 2023 May 29;23(11):5166. doi: 10.3390/s23115166.

引用本文的文献

A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks.使用多个机器学习框架对病毒逃逸模型语言进行的系统评估。

J R Soc Interface. 2025 Apr;22(225):20240598. doi: 10.1098/rsif.2024.0598. Epub 2025 Apr 30.

SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions.SSEmb：一种蛋白质序列和结构的联合嵌入方法，可实现稳健的变体效应预测。

Nat Commun. 2024 Nov 7;15(1):9646. doi: 10.1038/s41467-024-53982-z.

Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability.天然和人工改造抗体景观的生物物理作图定量评估了抗体可开发性的可塑性。

Commun Biol. 2024 Jul 31;7(1):922. doi: 10.1038/s42003-024-06561-3.

Prediction of Tribological Properties of UHMWPE/SiC Polymer Composites Using Machine Learning Techniques.使用机器学习技术预测超高分子量聚乙烯/碳化硅聚合物复合材料的摩擦学性能

Polymers (Basel). 2023 Oct 11;15(20):4057. doi: 10.3390/polym15204057.

本文引用的文献

Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations.稳定性预测器：一种基于结构的图变换框架，用于识别稳定化突变。

Nat Commun. 2024 Jul 23;15(1):6170. doi: 10.1038/s41467-024-49780-2.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Using machine learning to predict the effects and consequences of mutations in proteins.利用机器学习预测蛋白质突变的影响和后果。

Curr Opin Struct Biol. 2023 Feb;78:102518. doi: 10.1016/j.sbi.2022.102518. Epub 2023 Jan 3.

Machine learning-aided engineering of hydrolases for PET depolymerization.基于机器学习的 PET 解聚用水解酶工程。

Nature. 2022 Apr;604(7907):662-667. doi: 10.1038/s41586-022-04599-z. Epub 2022 Apr 27.

TANGO: A GO-Term Embedding Based Method for Protein Semantic Similarity Prediction.TANGO：一种基于GO术语嵌入的蛋白质语义相似性预测方法。

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):694-706. doi: 10.1109/TCBB.2022.3143480. Epub 2023 Feb 3.

Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation.利用蛋白质稳定性和保守性分析预测和解释大规模诱变数据。

Cell Rep. 2022 Jan 11;38(2):110207. doi: 10.1016/j.celrep.2021.110207.

ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT：一种通用的蛋白质序列和功能深度学习模型。

Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.通用情境化蛋白质嵌入在跨物种蛋白质功能预测中的作用

Evol Bioinform Online. 2021 Dec 3;17:11769343211062608. doi: 10.1177/11769343211062608. eCollection 2021.

Improved Bst DNA Polymerase Variants Derived a Machine Learning Approach.基于机器学习方法得到的改良 Bst DNA 聚合酶变体。

Biochemistry. 2023 Jan 17;62(2):410-418. doi: 10.1021/acs.biochem.1c00451. Epub 2021 Nov 11.

Learning the local landscape of protein structures with convolutional neural networks.用卷积神经网络学习蛋白质结构的局部景观。

J Biol Phys. 2021 Dec;47(4):435-454. doi: 10.1007/s10867-021-09593-6. Epub 2021 Nov 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

两种基于序列和结构的 ML 模型已经学习了蛋白质生物化学的不同方面。

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献