SSpro/ACCpro 5：利用序列谱、机器学习和结构相似性对蛋白质二级结构和相对溶剂可及性进行近乎完美的预测。

SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity.

作者信息

Magnan Christophe N, Baldi Pierre

机构信息

Department of Computer Science and Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA Department of Computer Science and Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.

出版信息

Bioinformatics. 2014 Sep 15;30(18):2592-7. doi: 10.1093/bioinformatics/btu352. Epub 2014 May 24.

DOI:10.1093/bioinformatics/btu352

PMID:24860169

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4215083/

Abstract

MOTIVATION

Accurately predicting protein secondary structure and relative solvent accessibility is important for the study of protein evolution, structure and function and as a component of protein 3D structure prediction pipelines. Most predictors use a combination of machine learning and profiles, and thus must be retrained and assessed periodically as the number of available protein sequences and structures continues to grow.

RESULTS

We present newly trained modular versions of the SSpro and ACCpro predictors of secondary structure and relative solvent accessibility together with their multi-class variants SSpro8 and ACCpro20. We introduce a sharp distinction between the use of sequence similarity alone, typically in the form of sequence profiles at the input level, and the additional use of sequence-based structural similarity, which uses similarity to sequences in the Protein Data Bank to infer annotations at the output level, and study their relative contributions to modern predictors. Using sequence similarity alone, SSpro's accuracy is between 79 and 80% (79% for ACCpro) and no other predictor seems to exceed 82%. However, when sequence-based structural similarity is added, the accuracy of SSpro rises to 92.9% (90% for ACCpro). Thus, by combining both approaches, these problems appear now to be essentially solved, as an accuracy of 100% cannot be expected for several well-known reasons. These results point also to several open technical challenges, including (i) achieving on the order of ≥ 80% accuracy, without using any similarity with known proteins and (ii) achieving on the order of ≥ 85% accuracy, using sequence similarity alone.

AVAILABILITY AND IMPLEMENTATION

SSpro, SSpro8, ACCpro and ACCpro20 programs, data and web servers are available through the SCRATCH suite of protein structure predictors at http://scratch.proteomics.ics.uci.edu.

摘要

动机

准确预测蛋白质二级结构和相对溶剂可及性对于蛋白质进化、结构和功能的研究以及作为蛋白质三维结构预测流程的一个组成部分而言至关重要。大多数预测器使用机器学习和轮廓的组合，因此随着可用蛋白质序列和结构数量的持续增长，必须定期重新训练和评估。

结果

我们展示了二级结构和相对溶剂可及性预测器SSpro和ACCpro的新训练模块化版本及其多类变体SSpro8和ACCpro20。我们明确区分了仅使用序列相似性（通常以输入级别的序列轮廓形式）和额外使用基于序列的结构相似性（利用与蛋白质数据库中序列的相似性在输出级别推断注释），并研究它们对现代预测器的相对贡献。仅使用序列相似性时，SSpro的准确率在79%至80%之间（ACCpro为79%），似乎没有其他预测器超过82%。然而，当添加基于序列的结构相似性时，SSpro的准确率提高到92.9%（ACCpro为90%）。因此，通过结合这两种方法，由于一些众所周知的原因无法期望达到100%的准确率，这些问题现在似乎已基本得到解决。这些结果还指出了几个开放的技术挑战，包括（i）在不使用与已知蛋白质的任何相似性的情况下达到≥80%的准确率水平，以及（ii）仅使用序列相似性达到≥85%的准确率水平。

可用性和实现方式

SSpro、SSpro8、ACCpro和ACCpro20程序、数据和网络服务器可通过蛋白质结构预测器的SCRATCH套件在http://scratch.proteomics.ics.uci.edu获得。

相似文献

SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity.SSpro/ACCpro 5：利用序列谱、机器学习和结构相似性对蛋白质二级结构和相对溶剂可及性进行近乎完美的预测。

Bioinformatics. 2014 Sep 15;30(18):2592-7. doi: 10.1093/bioinformatics/btu352. Epub 2014 May 24.

SSpro/ACCpro 6: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, deep learning and structural similarity.SSpro/ACCpro 6：使用轮廓、深度学习和结构相似性进行蛋白质二级结构和相对溶剂可及性的近乎完美预测。

Bioinformatics. 2022 Mar 28;38(7):2064-2065. doi: 10.1093/bioinformatics/btac019.

Prediction of coordination number and relative solvent accessibility in proteins.蛋白质中配位数和相对溶剂可及性的预测。

Proteins. 2002 May 1;47(2):142-53. doi: 10.1002/prot.10069.

Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles.使用递归神经网络和轮廓改进三类和八类蛋白质二级结构的预测。

Proteins. 2002 May 1;47(2):228-35. doi: 10.1002/prot.10082.

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information.通过序列和结构信息的共识组合器准确预测蛋白质二级结构和溶剂可及性。

BMC Bioinformatics. 2007 Jun 14;8:201. doi: 10.1186/1471-2105-8-201.

Improved protein relative solvent accessibility prediction using deep multi-view feature learning framework.利用深度多视图特征学习框架提高蛋白质相对溶剂可及性预测。

Anal Biochem. 2021 Oct 15;631:114358. doi: 10.1016/j.ab.2021.114358. Epub 2021 Aug 31.

Deep architectures for protein contact map prediction.用于蛋白质接触图预测的深度架构。

Bioinformatics. 2012 Oct 1;28(19):2449-57. doi: 10.1093/bioinformatics/bts475. Epub 2012 Jul 30.

Fold recognition by concurrent use of solvent accessibility and residue depth.通过同时使用溶剂可及性和残基深度进行折叠识别。

Proteins. 2007 Aug 15;68(3):636-45. doi: 10.1002/prot.21459.

Predicting protein secondary structure and solvent accessibility with an improved multiple linear regression method.使用改进的多元线性回归方法预测蛋白质二级结构和溶剂可及性。

Proteins. 2005 Nov 15;61(3):473-80. doi: 10.1002/prot.20645.

SCRATCH: a protein structure and structural feature prediction server.SCRATCH：一个蛋白质结构和结构特征预测服务器。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W72-6. doi: 10.1093/nar/gki396.

引用本文的文献

Exploration of Comprehensive Structural and Functional Potential of Recombinant Proteins Using Cutting-Edge Bioinformatics Tools.使用前沿生物信息学工具探索重组蛋白的综合结构和功能潜力。

Appl Biochem Biotechnol. 2025 Sep 9. doi: 10.1007/s12010-025-05366-2.

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation.CPE-Pro：一种用于蛋白质表征和起源评估的结构敏感深度学习方法。

Interdiscip Sci. 2025 Jun 8. doi: 10.1007/s12539-025-00732-4.

Advancements in one-dimensional protein structure prediction using machine learning and deep learning.利用机器学习和深度学习进行一维蛋白质结构预测的进展。

Comput Struct Biotechnol J. 2025 Apr 3;27:1416-1430. doi: 10.1016/j.csbj.2025.04.005. eCollection 2025.

PyPropel: a Python-based tool for efficiently processing and characterising protein data.PyPropel：一个用于高效处理和表征蛋白质数据的基于Python的工具。

BMC Bioinformatics. 2025 Mar 1;26(1):70. doi: 10.1186/s12859-025-06079-3.

Combined immunoinformatic approaches with computational biochemistry for development of subunit-based vaccine against Lawsonia intracellularis.结合免疫信息学方法与计算生物化学开发针对胞内劳森菌的亚单位疫苗。

PLoS One. 2025 Feb 24;20(2):e0314254. doi: 10.1371/journal.pone.0314254. eCollection 2025.

Post-processing enhances protein secondary structure prediction with second order deep learning and embeddings.后处理通过二阶深度学习和嵌入增强蛋白质二级结构预测。

Comput Struct Biotechnol J. 2025 Jan 2;27:243-251. doi: 10.1016/j.csbj.2024.12.022. eCollection 2025.

PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs).淡色艾尔6.0：利用预训练语言模型预测蛋白质相对溶剂可及性

Biomolecules. 2025 Jan 2;15(1):49. doi: 10.3390/biom15010049.

Machine Learning Techniques to Infer Protein Structure and Function from Sequences: A Comprehensive Review.基于序列推断蛋白质结构和功能的机器学习技术：全面综述。

Methods Mol Biol. 2025;2867:79-104. doi: 10.1007/978-1-0716-4196-5_5.

Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks.深度学习语言模型和变换网络在蛋白质二级结构预测中的改进。

Methods Mol Biol. 2025;2867:43-53. doi: 10.1007/978-1-0716-4196-5_3.

ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure.ILMCNet：一种利用 PLM 处理特征并采用 CRF 预测蛋白质二级结构的深度神经网络模型。

Genes (Basel). 2024 Oct 21;15(10):1350. doi: 10.3390/genes15101350.

本文引用的文献

The Dropout Learning Algorithm.辍学学习算法

Artif Intell. 2014 May;210:78-122. doi: 10.1016/j.artint.2014.02.004.

Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility.Porter，PaleAle 4.0：蛋白质二级结构和相对溶剂可及性的高精度预测。

Bioinformatics. 2013 Aug 15;29(16):2056-8. doi: 10.1093/bioinformatics/btt344. Epub 2013 Jun 14.

Scalable web services for the PSIPRED Protein Analysis Workbench.可扩展的 Web 服务，用于 PSIPRED 蛋白质分析工作平台。

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W349-57. doi: 10.1093/nar/gkt381. Epub 2013 Jun 8.

Proteins. 2008 May 1;71(2):891-902. doi: 10.1002/prot.21770.

UniRef: comprehensive and non-redundant UniProt reference clusters.UniRef：全面且无冗余的UniProt参考簇。

Bioinformatics. 2007 May 15;23(10):1282-8. doi: 10.1093/bioinformatics/btm098. Epub 2007 Mar 22.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit：一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。

Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.

SCRATCH: a protein structure and structural feature prediction server.SCRATCH：一个蛋白质结构和结构特征预测服务器。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W72-6. doi: 10.1093/nar/gki396.

UniqueProt: Creating representative protein sequence sets.UniqueProt：创建代表性蛋白质序列集。

Nucleic Acids Res. 2003 Jul 1;31(13):3789-91. doi: 10.1093/nar/gkg620.

Proteins. 2002 May 1;47(2):228-35. doi: 10.1002/prot.10082.

Prediction of coordination number and relative solvent accessibility in proteins.蛋白质中配位数和相对溶剂可及性的预测。

Proteins. 2002 May 1;47(2):142-53. doi: 10.1002/prot.10069.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验