Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan.
Protein Sci. 2023 Sep;32(9):e4739. doi: 10.1002/pro.4739.
Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self-attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder (mybinder.org). The only required input is the primary sequence. The training scripts are available on GitHub (https://github.com/allmwh/IFF). The training datasets have been deposited in an Open Science Framework repository (https://osf.io/jk29b).
蛋白质同源序列比对中的保守残基在结构或功能上很重要。然而,对于无序蛋白质或具有无序区域(IDR)的蛋白质,由于它们缺乏限制进化的空间结构,因此比对通常会失败。尽管序列不同,但 IDR 的物理化学特征可能在维持功能方面得以保留。因此,一种检索常见 IDR 特征的方法可能有助于识别功能重要的残基。我们应用无监督对比学习,使用自注意力神经元网络在人类 IDR 直系同源物上训练模型。模型中的参数经过训练,可以匹配直系同源物对中的序列,但不能匹配其他 IDR 中的序列。经过训练的模型成功识别了先前实验研究中报告的关键残基,特别是那些具有整体模式(例如,多个芳香族残基或带电荷的块)而不是短基序的残基。这种预测模型可用于识别其他蛋白质中潜在的重要残基,从而加深我们对其功能的理解。经过训练的模型可以直接从 GitHub 存储库中的 Jupyter Notebook 中使用 Binder(mybinder.org)运行。唯一需要的输入是主要序列。训练脚本可在 GitHub(https://github.com/allmwh/IFF)上获得。训练数据集已存储在开放科学框架存储库(https://osf.io/jk29b)中。