蛋白质嵌入预测无序区域的结合残基。

Protein embeddings predict binding residues in disordered regions.

机构信息

School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany.

Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.

出版信息

Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.

DOI:10.1038/s41598-024-64211-4

PMID:38866950

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11169622/

Abstract

The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .

摘要

蛋白质结合残基的鉴定有助于了解它们的生物学过程，因为蛋白质功能通常是通过配体结合来定义的，例如与其他蛋白质、小分子、离子或核苷酸结合。预测结合残基的方法通常会对固有无序蛋白质或区域（IDPs/IDPRs）出错，这些区域通常也被称为分子识别特征（MoRFs）。在这里，我们提出了一种新的机器学习（ML）模型，专门用于预测 IDPR 中的结合区域。所提出的模型 IDBindT5 利用了蛋白质语言模型（pLM）ProtT5 的嵌入来达到 57.2±3.6%（95%置信区间）的平衡准确性。在相同的数据集中评估时，这与依赖于专家设计的特征和来自多个序列比对（MSAs）的进化信息的最先进（SOTA）方法 ANCHOR2 和 DeepDISOBind 没有差异。在其他数据上评估时，诸如 SPOT-MoRF 之类的方法达到了更高的 MCC。IDBindT5 的 SOTA 预测比其他方法快得多，轻松实现了全蛋白质组分析。我们的研究结果强调了 pLMs 作为探索和预测无序蛋白质特征的有前途的方法的潜力。模型和综合手册可在 https://github.com/jahnl/binding_in_disorder 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21b/11169622/9f85f7492685/41598_2024_64211_Fig1_HTML.jpg

相似文献

Protein embeddings predict binding residues in disordered regions.蛋白质嵌入预测无序区域的结合残基。

Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.

MoRFPred-plus: Computational Identification of MoRFs in Protein Sequences using Physicochemical Properties and HMM profiles.MoRFPred-plus：利用物理化学性质和隐马尔可夫模型轮廓对蛋白质序列中的分子识别特征进行计算识别

J Theor Biol. 2018 Jan 21;437:9-16. doi: 10.1016/j.jtbi.2017.10.015. Epub 2017 Oct 16.

Molecular Recognition Features in Zika Virus Proteome.寨卡病毒蛋白组中的分子识别特征。

J Mol Biol. 2018 Aug 3;430(16):2372-2388. doi: 10.1016/j.jmb.2017.10.018. Epub 2017 Nov 7.

MoRFPred_en: Sequence-based prediction of MoRFs using an ensemble learning strategy.MoRFPred_en：使用集成学习策略基于序列预测莫尔费（MoRFs）。

J Bioinform Comput Biol. 2019 Dec;17(6):1940015. doi: 10.1142/S0219720019400158.

Identifying molecular recognition features in intrinsically disordered regions of proteins by transfer learning.通过迁移学习识别蛋白质无规则卷曲区域的分子识别特征。

Bioinformatics. 2020 Feb 15;36(4):1107-1113. doi: 10.1093/bioinformatics/btz691.

MoRF_ESM: Prediction of MoRFs in disordered proteins based on a deep transformer protein language model.MoRF_ESM：基于深度变压器蛋白质语言模型预测无序蛋白质中的分子识别特征片段

J Bioinform Comput Biol. 2024 Apr;22(2):2450006. doi: 10.1142/S0219720024500069. Epub 2024 May 28.

Identifying short disorder-to-order binding regions in disordered proteins with a deep convolutional neural network method.使用深度卷积神经网络方法识别无序蛋白质中的短无序到有序结合区域。

J Bioinform Comput Biol. 2019 Feb;17(1):1950004. doi: 10.1142/S0219720019500045.

MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation.MFSPSSMpred：基于上下文局部进化保守性识别无序蛋白中的短无序到有序结合区域。

BMC Bioinformatics. 2013 Oct 4;14:300. doi: 10.1186/1471-2105-14-300.

OPAL+: Length-Specific MoRF Prediction in Intrinsically Disordered Protein Sequences.OPAL+：在天然无序蛋白质序列中进行长度特异性 MoRF 预测。

Proteomics. 2019 Mar;19(6):e1800058. doi: 10.1002/pmic.201800058. Epub 2018 Nov 2.

Predicting MoRFs in protein sequences using HMM profiles.使用隐马尔可夫模型（HMM）概况预测蛋白质序列中的分子识别特征（MoRF）。

BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):504. doi: 10.1186/s12859-016-1375-0.

引用本文的文献

A Survey of Pretrained Protein Language Models.预训练蛋白质语言模型综述

Methods Mol Biol. 2025;2941:1-29. doi: 10.1007/978-1-0716-4623-6_1.

Multimeric protein interaction and complex prediction: Structure, dynamics and function.多聚体蛋白质相互作用与复合物预测：结构、动力学与功能

Comput Struct Biotechnol J. 2025 May 16;27:1975-1997. doi: 10.1016/j.csbj.2025.05.009. eCollection 2025.

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.PLM-DBPs：通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf245.

Predicting Protein Function in the AI and Big Data Era.人工智能与大数据时代的蛋白质功能预测

Biochemistry. 2025 Jun 3;64(11):2345-2352. doi: 10.1021/acs.biochem.5c00186. Epub 2025 May 17.

Advancements in one-dimensional protein structure prediction using machine learning and deep learning.利用机器学习和深度学习进行一维蛋白质结构预测的进展。

Comput Struct Biotechnol J. 2025 Apr 3;27:1416-1430. doi: 10.1016/j.csbj.2025.04.005. eCollection 2025.

bindNode24: Competitive binding residue prediction with 60 % smaller model.bindNode24：使用小60%的模型进行竞争性结合残基预测。

Comput Struct Biotechnol J. 2025 Mar 11;27:1060-1066. doi: 10.1016/j.csbj.2025.02.042. eCollection 2025.

CaLMPhosKAN: prediction of general phosphorylation sites in proteins via fusion of codon aware embeddings with amino acid aware embeddings and wavelet-based Kolmogorov-Arnold network.CaLMPhosKAN：通过将密码子感知嵌入与氨基酸感知嵌入以及基于小波的柯尔莫哥洛夫 - 阿诺德网络融合来预测蛋白质中的一般磷酸化位点

Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf124.

Frontiers in integrative structural modeling of macromolecular assemblies.大分子组装体的整合结构建模前沿

QRB Discov. 2025 Jan 22;6:e3. doi: 10.1017/qrd.2024.15. eCollection 2025.

A deep learning method for predicting interactions for intrinsically disordered regions of proteins.一种用于预测蛋白质内在无序区域相互作用的深度学习方法。

bioRxiv. 2025 Jan 22:2024.12.19.629373. doi: 10.1101/2024.12.19.629373.

StrIDR: a database of intrinsically disordered regions of proteins with experimentally resolved structures.StrIDR：一个具有实验解析结构的蛋白质内在无序区域数据库。

bioRxiv. 2024 Aug 26:2024.08.22.609111. doi: 10.1101/2024.08.22.609111.

本文引用的文献

Critical assessment of protein intrinsic disorder prediction (CAID) - Results of round 2.蛋白质固有无序预测（CAID）的批判性评估——第 2 轮结果。

Proteins. 2023 Dec;91(12):1925-1934. doi: 10.1002/prot.26582. Epub 2023 Aug 25.

LambdaPP: Fast and accessible protein-specific phenotype predictions.LambdaPP：快速且易于使用的蛋白质特异性表型预测。

Protein Sci. 2023 Jan;32(1):e4524. doi: 10.1002/pro.4524.

MobiDB: 10 years of intrinsically disordered proteins.MobiDB：10 年的无序蛋白质。

Nucleic Acids Res. 2023 Jan 6;51(D1):D438-D444. doi: 10.1093/nar/gkac1065.

UniProt: the Universal Protein Knowledgebase in 2023.UniProt：2023 年的通用蛋白质知识库。

Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052.

BepiPred-3.0: Improved B-cell epitope prediction using protein language models.BepiPred-3.0：使用蛋白质语言模型改进 B 细胞表位预测。

Protein Sci. 2022 Dec;31(12):e4497. doi: 10.1002/pro.4497.

SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。

Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.

Intrinsic protein disorder and conditional folding in AlphaFoldDB.AlphaFoldDB 中的内在蛋白质无序和条件折叠。

Protein Sci. 2022 Nov;31(11):e4466. doi: 10.1002/pro.4466.

TMbed: transmembrane proteins predicted through language model embeddings.TMbed：通过语言模型嵌入预测的跨膜蛋白。

BMC Bioinformatics. 2022 Aug 8;23(1):326. doi: 10.1186/s12859-022-04873-x.

NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning.NetSurfP-3.0：通过蛋白质语言模型和深度学习实现蛋白质结构特征的准确快速预测。

Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。

Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

蛋白质嵌入预测无序区域的结合残基。

Protein embeddings predict binding residues in disordered regions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献