Chatzimiltis Sotiris, Agathocleous Michalis, Promponas Vasilis J, Christodoulou Chris
University of Cyprus, Department of Computer Science, Nicosia, Cyprus.
5G/6GIC, Institute for Communication Systems (ICS), University of Surrey, Guildford, United Kingdom.
Comput Struct Biotechnol J. 2025 Jan 2;27:243-251. doi: 10.1016/j.csbj.2024.12.022. eCollection 2025.
Protein Secondary Structure Prediction (PSSP) is regarded as a challenging task in bioinformatics, and numerous approaches to achieve a more accurate prediction have been proposed. Accurate PSSP can be instrumental in inferring protein tertiary structure and their functions. Machine Learning and in particular Deep Learning approaches show promising results for the PSSP problem. In this paper, we deploy a Convolutional Neural Network (CNN) trained with the Subsampled Hessian Newton (SHN) method (a Hessian Free Optimisation variant), with a two- dimensional input representation of embeddings extracted from a language model pretrained with protein sequences. Utilising a CNN trained with the SHN method and the input embeddings, we achieved on average a 79.96% per residue (Q3) accuracy on the CB513 dataset and 81.45% Q3 accuracy on the PISCES dataset (without any post-processing techniques applied). The application of ensembles and filtering techniques to the results of the CNN improved the overall prediction performance. The Q3 accuracy on the CB513 increased to 93.65% and for the PISCES dataset to 87.13%. Moreover, our method was evaluated using the CASP13 dataset where we showed that as the post-processing window size increased, the prediction performance increased as well. In fact, with the biggest post-processing window size (limited by the smallest CASP13 protein), we achieved a Q3 accuracy of 98.12% and a Segment Overlap (SOV) score of 96.98 on the CASP13 dataset when the CNNs were trained with the PISCES dataset. Finally, we showed that input representations from embeddings can perform equally well as representations extracted from multiple sequence alignments.
蛋白质二级结构预测(PSSP)在生物信息学中被视为一项具有挑战性的任务,并且已经提出了许多方法来实现更准确的预测。准确的PSSP有助于推断蛋白质的三级结构及其功能。机器学习,尤其是深度学习方法,在PSSP问题上显示出了有前景的结果。在本文中,我们部署了一个使用子采样海森牛顿(SHN)方法(海森自由优化变体)训练的卷积神经网络(CNN),其输入为从用蛋白质序列预训练的语言模型中提取的嵌入的二维表示。利用用SHN方法训练的CNN和输入嵌入,我们在CB513数据集上平均每个残基的准确率(Q3)达到了79.96%,在双鱼座数据集上Q3准确率达到了81.45%(未应用任何后处理技术)。将集成和过滤技术应用于CNN的结果提高了整体预测性能。CB513数据集上的Q3准确率提高到了93.65%,双鱼座数据集上提高到了87.13%。此外,我们的方法使用CASP13数据集进行了评估,结果表明随着后处理窗口大小的增加,预测性能也随之提高。事实上,在使用双鱼座数据集训练CNN时,对于最大的后处理窗口大小(受最小的CASP13蛋白质限制),我们在CASP13数据集上实现了98.12%的Q3准确率和96.98%的片段重叠(SOV)分数。最后,我们表明来自嵌入的输入表示与从多序列比对中提取的表示表现相当。