Graduate School of Environmental Science, Hokkaido University, Sapporo 060-0810, Japan.
Faculty of Environmental Earth Science, Hokkaido University, Sapporo 060-0810, Japan.
J Chem Inf Model. 2024 Apr 8;64(7):2901-2911. doi: 10.1021/acs.jcim.3c01202. Epub 2023 Oct 26.
Intrinsically disordered proteins (IDPs) play a vital role in various biological processes and have attracted increasing attention in the past few decades. Predicting IDPs from the primary structures of proteins offers a rapid and facile means of protein analysis without necessitating crystal structures. In particular, machine learning methods have demonstrated their potential in this field. Recently, protein language models (PLMs) are emerging as a promising approach to extracting essential information from protein sequences and have been employed in protein modeling to utilize their advantages of precision and efficiency. In this article, we developed a novel IDP prediction method named IDP-ELM to predict the intrinsically disordered regions (IDRs) as well as their functions including disordered flexible linkers and disordered protein binding. This method utilizes high-dimensional representations extracted from several state-of-the-art PLMs and predicts IDRs by ensemble learning based on bidirectional recurrent neural networks. The performance of the method was evaluated on two independent test data sets from CAID (critical assessment of protein intrinsic disorder prediction) and CAID2, indicating notable improvements in terms of area under the receiver operating characteristic (AUC), Matthew's correlation coefficient (MCC), and F1 score. Moreover, IDP-ELM requires solely protein sequences as inputs and does not entail a time-consuming process of protein profile generation, which is a prerequisite for most existing state-of-the-art methods, enabling an accurate, fast, and convenient tool for proteome-level analysis. The corresponding reproducible source code and model weights are available at https://github.com/xu-shi-jie/idp-elm.
无定形蛋白质(IDPs)在各种生物过程中发挥着重要作用,在过去几十年中引起了越来越多的关注。从蛋白质的一级结构预测 IDPs 提供了一种快速简便的蛋白质分析方法,而无需晶体结构。特别是,机器学习方法在该领域显示出了它们的潜力。最近,蛋白质语言模型(PLMs)作为从蛋白质序列中提取重要信息的一种很有前途的方法出现,并已被用于蛋白质建模,以利用其精确性和效率的优势。在本文中,我们开发了一种名为 IDP-ELM 的新型 IDP 预测方法,用于预测内在无序区域(IDRs)及其功能,包括无序柔性接头和无序蛋白质结合。该方法利用从几个最先进的 PLMs 中提取的高维表示,并通过基于双向递归神经网络的集成学习来预测 IDRs。该方法在来自 CAID(蛋白质内在无序预测的关键评估)和 CAID2 的两个独立测试数据集上进行了评估,在接收者操作特征曲线下面积(AUC)、马修相关系数(MCC)和 F1 得分方面均有显著提高。此外,IDP-ELM 仅需要蛋白质序列作为输入,不需要进行大多数现有最先进方法所必需的蛋白质谱生成耗时过程,从而为蛋白质组水平分析提供了一种准确、快速和方便的工具。相应的可重现源代码和模型权重可在 https://github.com/xu-shi-jie/idp-elm 上获得。