Liu Sicen, Chen Shutao, Bai Tao, Liu Bin
SMBU-MSU-BIT Joint Laboratory on Bioinformatics and Engineering Biology, Shenzhen MSU-BIT University, Shenzhen, Guangdong 518172, China.
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf362.
Intrinsic disorder regions (IDRs) play a significant role in diverse biological processes and are widely distributed in proteins. Thus, accurately predicting these regions is essential for analyzing protein structure and function. Amino acid feature extraction servers as a foundational process in the development of computational predictive models. Existing methods typically rely on traditional biological features (e.g. PSSM) or use pre-trained protein language models (PPLMs) to capture sequence semantic information, often resorting to straightforward feature concatenation. However, these approaches fail to capture the multi-semantic interactions between traditional biological features and PPLMs-based features.
In this study, we propose a method named FusionEncoder designed for the integration of traditional biological and PPLMs-based features of the protein. FusionEncoder is a fusion network built on a variant of long short-term memory (LSTM). We consider traditional biological features and PPLMs-based features to be two types of semantic inputs within a "multi-semantic" space. Traditional features are input into the cell state of the LSTM, while PPLMs-based features are fed into the input part. A fusion cell is then utilized to fuse these two types of features. This strategy leverages the capability of LSTM to encode long sequences, enhancing context-aware semantic learning of amino acid sequences. Finally, a transformer-based encoder layer is employed to predict the IDRs. Evaluation on four independent test datasets indicate that FusionEncoder obviously improves the accuracy of amino acid feature representation and achieves superior performance compared to the other existing methods.
To facilitate accessibility for experimental researchers, a user-friendly and publicly available webserver for the FusionEncoder predictor has been deployed at http://bliulab.net/FusionEncoder/. FusionEncoder is expected to serve as a valuable tool for the accurate identification of IDRs.
内在无序区域(IDRs)在多种生物过程中发挥着重要作用,且广泛分布于蛋白质中。因此,准确预测这些区域对于分析蛋白质结构和功能至关重要。氨基酸特征提取是计算预测模型开发中的一个基础过程。现有方法通常依赖传统生物学特征(如位置特异性得分矩阵,PSSM)或使用预训练的蛋白质语言模型(PPLMs)来捕捉序列语义信息,常常采用直接的特征拼接方式。然而,这些方法未能捕捉传统生物学特征与基于PPLMs的特征之间的多语义交互。
在本研究中,我们提出了一种名为融合编码器(FusionEncoder)的方法,用于整合蛋白质的传统生物学特征和基于PPLMs的特征。FusionEncoder是一个基于长短期记忆(LSTM)变体构建的融合网络。我们将传统生物学特征和基于PPLMs的特征视为“多语义”空间中的两种语义输入。传统特征输入到LSTM的细胞状态中,而基于PPLMs的特征则输入到输入部分。然后利用一个融合单元来融合这两种特征。这种策略利用了LSTM对长序列进行编码的能力,增强了氨基酸序列的上下文感知语义学习。最后,采用基于Transformer的编码器层来预测IDRs。对四个独立测试数据集的评估表明,FusionEncoder明显提高了氨基酸特征表示的准确性,与其他现有方法相比性能更优。
为方便实验研究人员使用,已在http://bliulab.net/FusionEncoder/ 部署了一个用户友好且公开可用的FusionEncoder预测器网络服务器。FusionEncoder有望成为准确识别IDRs的有价值工具。