IFF：使用机器学习鉴定蛋白质无规则卷曲区域的关键残基

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning.

机构信息

Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.

Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan.

出版信息

Protein Sci. 2023 Sep;32(9):e4739. doi: 10.1002/pro.4739.

DOI:10.1002/pro.4739

PMID:37498545

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10443345/

Abstract

Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self-attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder (mybinder.org). The only required input is the primary sequence. The training scripts are available on GitHub (https://github.com/allmwh/IFF). The training datasets have been deposited in an Open Science Framework repository (https://osf.io/jk29b).

摘要

蛋白质同源序列比对中的保守残基在结构或功能上很重要。然而，对于无序蛋白质或具有无序区域（IDR）的蛋白质，由于它们缺乏限制进化的空间结构，因此比对通常会失败。尽管序列不同，但 IDR 的物理化学特征可能在维持功能方面得以保留。因此，一种检索常见 IDR 特征的方法可能有助于识别功能重要的残基。我们应用无监督对比学习，使用自注意力神经元网络在人类 IDR 直系同源物上训练模型。模型中的参数经过训练，可以匹配直系同源物对中的序列，但不能匹配其他 IDR 中的序列。经过训练的模型成功识别了先前实验研究中报告的关键残基，特别是那些具有整体模式（例如，多个芳香族残基或带电荷的块）而不是短基序的残基。这种预测模型可用于识别其他蛋白质中潜在的重要残基，从而加深我们对其功能的理解。经过训练的模型可以直接从 GitHub 存储库中的 Jupyter Notebook 中使用 Binder（mybinder.org）运行。唯一需要的输入是主要序列。训练脚本可在 GitHub（https://github.com/allmwh/IFF）上获得。训练数据集已存储在开放科学框架存储库（https://osf.io/jk29b）中。

相似文献

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning.IFF：使用机器学习鉴定蛋白质无规则卷曲区域的关键残基

Protein Sci. 2023 Sep;32(9):e4739. doi: 10.1002/pro.4739.

Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.利用进化进行对比学习来发现无序区域的分子特征。

PLoS Comput Biol. 2022 Jun 29;18(6):e1010238. doi: 10.1371/journal.pcbi.1010238. eCollection 2022 Jun.

SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.SHARK 能够在不可比对和无序序列中灵敏地检测进化同源物和功能类似物。

Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9.

The return of the rings: Evolutionary convergence of aromatic residues in the intrinsically disordered regions of RNA-binding proteins for liquid-liquid phase separation.环的回归：RNA 结合蛋白无规卷曲区域中芳香残基的液液相分离进化趋同。

Protein Sci. 2022 May;31(5):e4317. doi: 10.1002/pro.4317.

TransDFL: Identification of Disordered Flexible Linkers in Proteins by Transfer Learning.基于迁移学习的蛋白质无序柔性连接子识别

Genomics Proteomics Bioinformatics. 2023 Apr;21(2):359-369. doi: 10.1016/j.gpb.2022.10.004. Epub 2022 Oct 19.

Functional Tuning of Intrinsically Disordered Regions in Human Proteins by Composition Bias.通过组成偏见对人类蛋白质中的无规卷曲区域进行功能调节。

Biomolecules. 2022 Oct 15;12(10):1486. doi: 10.3390/biom12101486.

Towards Decoding the Sequence-Based Grammar Governing the Functions of Intrinsically Disordered Protein Regions.探索基于序列的语法，以揭示无规则蛋白区域功能的奥秘。

J Mol Biol. 2021 Jun 11;433(12):166724. doi: 10.1016/j.jmb.2020.11.023. Epub 2020 Nov 26.

OPAL+: Length-Specific MoRF Prediction in Intrinsically Disordered Protein Sequences.OPAL+：在天然无序蛋白质序列中进行长度特异性 MoRF 预测。

Proteomics. 2019 Mar;19(6):e1800058. doi: 10.1002/pmic.201800058. Epub 2018 Nov 2.

Conformational ensembles of the human intrinsically disordered proteome.人类内在无序蛋白质组的构象集合

Nature. 2024 Feb;626(8000):897-904. doi: 10.1038/s41586-023-07004-5. Epub 2024 Jan 31.

Sequence-to-Conformation Relationships of Disordered Regions Tethered to Folded Domains of Proteins.无序区域与蛋白质折叠域连接的序列-构象关系。

J Mol Biol. 2018 Aug 3;430(16):2403-2421. doi: 10.1016/j.jmb.2018.05.012. Epub 2018 May 12.

引用本文的文献

SHARK: web server for alignment-free homology assessment for intrinsically disordered and unalignable protein regions.SHARK：用于对内在无序和不可比对的蛋白质区域进行无比对同源性评估的网络服务器。

Nucleic Acids Res. 2025 Jul 7;53(W1):W512-W519. doi: 10.1093/nar/gkaf408.

SHARK-capture identifies functional motifs in intrinsically disordered protein regions.SHARK-capture可识别内在无序蛋白质区域中的功能基序。

Protein Sci. 2025 Apr;34(4):e70091. doi: 10.1002/pro.70091.

Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9.

本文引用的文献

Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.利用进化进行对比学习来发现无序区域的分子特征。

PLoS Comput Biol. 2022 Jun 29;18(6):e1010238. doi: 10.1371/journal.pcbi.1010238. eCollection 2022 Jun.

Phase separation driven by interchangeable properties in the intrinsically disordered regions of protein paralogs.由蛋白质同源物无规则区域的可互换特性驱动的相分离。

Commun Biol. 2022 Apr 29;5(1):400. doi: 10.1038/s42003-022-03354-4.

Protein Sci. 2022 May;31(5):e4317. doi: 10.1002/pro.4317.

The dynamic properties of a nuclear coactivator binding domain are evolutionarily conserved.核共激活因子结合结构域的动态特性在进化上是保守的。

Commun Biol. 2022 Mar 30;5(1):286. doi: 10.1038/s42003-022-03217-y.

ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT：一种通用的蛋白质序列和功能深度学习模型。

Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.

On the Potential of Machine Learning to Examine the Relationship Between Sequence, Structure, Dynamics and Function of Intrinsically Disordered Proteins.基于机器学习研究无规卷曲蛋白序列、结构、动力学与功能关系的潜力

J Mol Biol. 2021 Oct 1;433(20):167196. doi: 10.1016/j.jmb.2021.167196. Epub 2021 Aug 12.

Accurate prediction of protein structures and interactions using a three-track neural network.使用三轨神经网络准确预测蛋白质结构和相互作用。

Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验