Suppr超能文献

蛋白质嵌入预测无序区域的结合残基。

Protein embeddings predict binding residues in disordered regions.

机构信息

School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany.

Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.

出版信息

Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.

Abstract

The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .

摘要

蛋白质结合残基的鉴定有助于了解它们的生物学过程,因为蛋白质功能通常是通过配体结合来定义的,例如与其他蛋白质、小分子、离子或核苷酸结合。预测结合残基的方法通常会对固有无序蛋白质或区域(IDPs/IDPRs)出错,这些区域通常也被称为分子识别特征(MoRFs)。在这里,我们提出了一种新的机器学习(ML)模型,专门用于预测 IDPR 中的结合区域。所提出的模型 IDBindT5 利用了蛋白质语言模型(pLM)ProtT5 的嵌入来达到 57.2±3.6%(95%置信区间)的平衡准确性。在相同的数据集中评估时,这与依赖于专家设计的特征和来自多个序列比对(MSAs)的进化信息的最先进(SOTA)方法 ANCHOR2 和 DeepDISOBind 没有差异。在其他数据上评估时,诸如 SPOT-MoRF 之类的方法达到了更高的 MCC。IDBindT5 的 SOTA 预测比其他方法快得多,轻松实现了全蛋白质组分析。我们的研究结果强调了 pLMs 作为探索和预测无序蛋白质特征的有前途的方法的潜力。模型和综合手册可在 https://github.com/jahnl/binding_in_disorder 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b21b/11169622/9f85f7492685/41598_2024_64211_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验