School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany.
Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.
Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.
The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .
蛋白质结合残基的鉴定有助于了解它们的生物学过程,因为蛋白质功能通常是通过配体结合来定义的,例如与其他蛋白质、小分子、离子或核苷酸结合。预测结合残基的方法通常会对固有无序蛋白质或区域(IDPs/IDPRs)出错,这些区域通常也被称为分子识别特征(MoRFs)。在这里,我们提出了一种新的机器学习(ML)模型,专门用于预测 IDPR 中的结合区域。所提出的模型 IDBindT5 利用了蛋白质语言模型(pLM)ProtT5 的嵌入来达到 57.2±3.6%(95%置信区间)的平衡准确性。在相同的数据集中评估时,这与依赖于专家设计的特征和来自多个序列比对(MSAs)的进化信息的最先进(SOTA)方法 ANCHOR2 和 DeepDISOBind 没有差异。在其他数据上评估时,诸如 SPOT-MoRF 之类的方法达到了更高的 MCC。IDBindT5 的 SOTA 预测比其他方法快得多,轻松实现了全蛋白质组分析。我们的研究结果强调了 pLMs 作为探索和预测无序蛋白质特征的有前途的方法的潜力。模型和综合手册可在 https://github.com/jahnl/binding_in_disorder 上获得。