Zhao Lufei, Li Jingyi, Zhan Weiqiang, Jiang Xuchu, Zhang Biao
Agricultural Science and Engineering School, Liaocheng University, Liaocheng, 252059, China.
School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China.
Sci Rep. 2024 Jul 17;14(1):16488. doi: 10.1038/s41598-024-67403-0.
Secondary structure prediction is a key step in understanding protein function and biological properties and is highly important in the fields of new drug development, disease treatment, bioengineering, etc. Accurately predicting the secondary structure of proteins helps to reveal how proteins are folded and how they function in cells. The application of deep learning models in protein structure prediction is particularly important because of their ability to process complex sequence information and extract meaningful patterns and features, thus significantly improving the accuracy and efficiency of prediction. In this study, a combined model integrating an improved temporal convolutional network (TCN), bidirectional long short-term memory (BiLSTM), and a multi-head attention (MHA) mechanism is proposed to enhance the accuracy of protein prediction in both eight-state and three-state structures. One-hot encoding features and word vector representations of physicochemical properties are incorporated. A significant emphasis is placed on knowledge distillation techniques utilizing the ProtT5 pretrained model, leading to performance improvements. The improved TCN, achieved through multiscale fusion and bidirectional operations, allows for better extraction of amino acid sequence features than traditional TCN models. The model demonstrated excellent prediction performance on multiple datasets. For the TS115, CB513 and PDB (2018-2020) datasets, the prediction accuracy of the eight-state structure of the six datasets in this paper reached 88.2%, 84.9%, and 95.3%, respectively, and the prediction accuracy of the three-state structure reached 91.3%, 90.3%, and 96.8%, respectively. This study not only improves the accuracy of protein secondary structure prediction but also provides an important tool for understanding protein structure and function, which is particularly applicable to resource-constrained contexts and provides a valuable tool for understanding protein structure and function.
二级结构预测是理解蛋白质功能和生物学特性的关键步骤,在新药开发、疾病治疗、生物工程等领域具有极其重要的意义。准确预测蛋白质的二级结构有助于揭示蛋白质的折叠方式以及它们在细胞中的功能。深度学习模型在蛋白质结构预测中的应用尤为重要,因为它们能够处理复杂的序列信息并提取有意义的模式和特征,从而显著提高预测的准确性和效率。在本研究中,提出了一种融合改进的时间卷积网络(TCN)、双向长短期记忆(BiLSTM)和多头注意力(MHA)机制的组合模型,以提高八状态和三状态结构中蛋白质预测的准确性。纳入了独热编码特征和物理化学性质的词向量表示。重点强调了利用ProtT5预训练模型的知识蒸馏技术,从而实现性能提升。通过多尺度融合和双向操作实现的改进型TCN,比传统的TCN模型能够更好地提取氨基酸序列特征。该模型在多个数据集上展示了出色的预测性能。对于TS115、CB513和PDB(2018 - 2020)数据集,本文六个数据集中八状态结构的预测准确率分别达到88.2%、84.9%和95.3%,三状态结构的预测准确率分别达到91.3%、90.3%和96.8%。本研究不仅提高了蛋白质二级结构预测的准确性,还为理解蛋白质结构和功能提供了重要工具,特别适用于资源受限的环境,并为理解蛋白质结构和功能提供了有价值的工具。