School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.
Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China.
PLoS Comput Biol. 2022 Oct 31;18(10):e1010668. doi: 10.1371/journal.pcbi.1010668. eCollection 2022 Oct.
Intrinsically disordered proteins and regions (IDP/IDRs) are widespread in living organisms and perform various essential molecular functions. These functions are summarized as six general categories, including entropic chain, assembler, scavenger, effector, display site, and chaperone. The alteration of IDP functions is responsible for many human diseases. Therefore, identifying the function of disordered proteins is helpful for the studies of drug target discovery and rational drug design. Experimental identification of the molecular functions of IDP in the wet lab is an expensive and laborious procedure that is not applicable on a large scale. Some computational methods have been proposed and mainly focus on predicting the entropic chain function of IDRs, while the computational predictive methods for the remaining five important categories of disordered molecular functions are desired. Motivated by the growing numbers of experimental annotated functional sequences and the need to expand the coverage of disordered protein function predictors, we proposed DMFpred for disordered molecular functions prediction, covering disordered assembler, scavenger, effector, display site and chaperone. DMFpred employs the Protein Cubic Language Model (PCLM), which incorporates three protein language models for characterizing sequences, structural and functional features of proteins, and attention-based alignment for understanding the relationship among three captured features and generating a joint representation of proteins. The PCLM was pre-trained with large-scaled IDR sequences and fine-tuned with functional annotation sequences for molecular function prediction. The predictive performance evaluation on five categories of functional and multi-functional residues suggested that DMFpred provides high-quality predictions. The web-server of DMFpred can be freely accessed from http://bliulab.net/DMFpred/.
无规蛋白和区域(IDP/IDRs)广泛存在于生物体中,执行各种重要的分子功能。这些功能总结为六个一般类别,包括熵链、组装器、清道夫、效应器、展示位点和分子伴侣。IDP 功能的改变是许多人类疾病的原因。因此,鉴定无规蛋白的功能有助于药物靶点发现和合理药物设计的研究。在湿实验室中实验鉴定 IDP 的分子功能是一项昂贵且费力的过程,不适用于大规模应用。已经提出了一些计算方法,主要集中于预测 IDRs 的熵链功能,而对于其余五个重要的无规分子功能类别,需要计算预测方法。受实验注释功能序列数量的不断增加以及需要扩展无规蛋白功能预测器的覆盖范围的驱动,我们提出了 DMFpred 用于无规分子功能预测,涵盖无规组装器、清道夫、效应器、展示位点和分子伴侣。DMFpred 采用蛋白质立方语言模型(PCLM),该模型结合了三种蛋白质语言模型,用于描述序列、蛋白质的结构和功能特征,以及基于注意力的对齐,用于理解三个捕获特征之间的关系并生成蛋白质的联合表示。PCLM 采用大规模的 IDR 序列进行预训练,并使用功能注释序列进行微调,以进行分子功能预测。对五类功能和多功能残基的预测性能评估表明,DMFpred 提供了高质量的预测。DMFpred 的网络服务器可从 http://bliulab.net/DMFpred/ 免费访问。