Suppr超能文献

ILMCNet:一种利用 PLM 处理特征并采用 CRF 预测蛋白质二级结构的深度神经网络模型。

ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure.

机构信息

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.

出版信息

Genes (Basel). 2024 Oct 21;15(10):1350. doi: 10.3390/genes15101350.

Abstract

BACKGROUND

Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research.

OBJECTIVES

We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage.

METHODS

To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures.

RESULTS

Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches.

CONCLUSIONS

This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.

摘要

背景

蛋白质二级结构预测(PSSP)是计算生物学中的一项关键任务,对于理解蛋白质功能和推进医学诊断至关重要。最近,整合多种氨基酸序列特征的方法在 PSSP 研究中受到了广泛关注。

目的

我们旨在从大量序列中自动提取由进化信息表示的附加特征,同时纳入位置信息以获得更全面的序列特征。此外,我们考虑了预测阶段二级结构之间的相互依赖性。

方法

为此,我们提出了一种深度神经网络模型 ILMCNet,它利用语言模型和条件随机场(CRF)。在多个大型数据库的序列上进行预训练的蛋白质语言模型(PLMs)可以提供包含进化信息的序列特征。ILMCNet 使用位置编码来确保输入特征包括位置信息。为了更好地利用这些特征,我们提出了一种混合网络架构,该架构采用 Transformer 编码器来增强特征,并集成了一个特征提取模块,该模块结合了卷积神经网络(CNN)和双向长短期记忆网络(BiLSTM)。这种设计能够在捕获全局双向信息的同时,深度提取局部特征。在预测阶段,ILMCNet 使用 CRF 来捕获二级结构之间的相互依赖性。

结果

在基准数据集(如 CB513、TS115、NEW364、CASP11 和 CASP12)上的实验结果表明,我们的方法的预测性能优于可比方法。

结论

本研究提出了一种新的 PSSP 研究方法,预计将在其他与蛋白质相关的研究领域(如蛋白质三级结构预测)中发挥重要作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d06/11507629/1e63dd03c5c3/genes-15-01350-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验