College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
Genes (Basel). 2024 Oct 21;15(10):1350. doi: 10.3390/genes15101350.
Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research.
We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage.
To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures.
Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches.
This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.
蛋白质二级结构预测(PSSP)是计算生物学中的一项关键任务,对于理解蛋白质功能和推进医学诊断至关重要。最近,整合多种氨基酸序列特征的方法在 PSSP 研究中受到了广泛关注。
我们旨在从大量序列中自动提取由进化信息表示的附加特征,同时纳入位置信息以获得更全面的序列特征。此外,我们考虑了预测阶段二级结构之间的相互依赖性。
为此,我们提出了一种深度神经网络模型 ILMCNet,它利用语言模型和条件随机场(CRF)。在多个大型数据库的序列上进行预训练的蛋白质语言模型(PLMs)可以提供包含进化信息的序列特征。ILMCNet 使用位置编码来确保输入特征包括位置信息。为了更好地利用这些特征,我们提出了一种混合网络架构,该架构采用 Transformer 编码器来增强特征,并集成了一个特征提取模块,该模块结合了卷积神经网络(CNN)和双向长短期记忆网络(BiLSTM)。这种设计能够在捕获全局双向信息的同时,深度提取局部特征。在预测阶段,ILMCNet 使用 CRF 来捕获二级结构之间的相互依赖性。
在基准数据集(如 CB513、TS115、NEW364、CASP11 和 CASP12)上的实验结果表明,我们的方法的预测性能优于可比方法。
本研究提出了一种新的 PSSP 研究方法,预计将在其他与蛋白质相关的研究领域(如蛋白质三级结构预测)中发挥重要作用。