无需对齐即可达到基于对齐轮廓的预测蛋白质二级和三级结构性质的准确性。

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.

机构信息

Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.

Institute for Glycomics, Griffith University, Parklands Dr. Southport, Goldcoast, QLD, 4222, Australia.

出版信息

Sci Rep. 2022 May 9;12(1):7607. doi: 10.1038/s41598-022-11684-w.

DOI:10.1038/s41598-022-11684-w

PMID:35534620

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9085874/

Abstract

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.

摘要

蛋白质语言模型已成为丰富序列信息和改进下游预测任务（如生物物理、结构和功能特性）的替代方法，例如多重序列比对。在这里，我们展示了一种名为 SPOT-1D-LM 的方法，该方法将传统的独热编码与来自两种不同语言模型（ProtTrans 和 ESM-1b）的嵌入相结合，用于输入，并在预测蛋白质 1D 二级和三级结构特性方面取得了比基于单序列的技术更大的准确性飞跃，包括所有六个测试集（TEST2018、TEST2020、Neff1-2020、CASP12-FM、CASP13-FM 和 CASP14-FM）的骨架扭转角、溶剂可及性和接触数。更重要的是，对于具有同源序列的蛋白质，它的性能可与基于构象的方法相媲美。例如，对于 TEST2018 和 TEST2020 蛋白质的三态二级结构（SS3）预测，SPOT-1D-LM 的准确率分别为 86.7%和 79.8%，而基于单序列的方法 SPOT-1D-Single 的准确率分别为 74.3%和 73.4%，基于构象的方法 SPOT-1D 的准确率分别为 86.2%和 80.5%。对于没有同源序列的蛋白质（Neff1-2020），SPOT-1D-LM 的 SS3 准确率为 80.41%，分别比 SPOT-1D-Single 和 SPOT-1D 高 3.8%和 8.3%。鉴于其快速的性能，SPOT-1D-LM 有望用于全基因组分析。此外，无需序列比对即可对二级和三级结构特性（如骨架角度和溶剂可及性）进行高精度预测，这表明在 AlphaFold2 时代之后仍然存在的剩余障碍情况下，无需同源序列即可实现蛋白质结构的高精度预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d08/9085874/263aa165a560/41598_2022_11684_Fig1_HTML.jpg

相似文献

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.

Sci Rep. 2022 May 9;12(1):7607. doi: 10.1038/s41598-022-11684-w.

SPOT-1D-Single: improving the single-sequence-based prediction of protein secondary structure, backbone angles, solvent accessibility and half-sphere exposures using a large training set and ensembled deep learning.

Bioinformatics. 2021 Oct 25;37(20):3464-3472. doi: 10.1093/bioinformatics/btab316.

OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networks.

Bioinformatics. 2020 Dec 22;36(20):5021-5026. doi: 10.1093/bioinformatics/btaa629.

SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model.

Bioinformatics. 2022 Mar 28;38(7):1888-1894. doi: 10.1093/bioinformatics/btac053.

Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks.

Bioinformatics. 2019 Jul 15;35(14):2403-2410. doi: 10.1093/bioinformatics/bty1006.

Multiple sequence alignment-based RNA language model and its application to structural inference.

Nucleic Acids Res. 2024 Jan 11;52(1):e3. doi: 10.1093/nar/gkad1031.

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts.

BMC Bioinformatics. 2011 Dec 14;12:472. doi: 10.1186/1471-2105-12-472.

TANGLE: two-level support vector regression approach for protein backbone torsion angle prediction from primary sequences.

PLoS One. 2012;7(2):e30361. doi: 10.1371/journal.pone.0030361. Epub 2012 Feb 2.

Structural profile matrices for predicting structural properties of proteins.

J Bioinform Comput Biol. 2020 Aug;18(4):2050022. doi: 10.1142/S0219720020500225. Epub 2020 Jul 10.

Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning.

J Comput Chem. 2018 Oct 5;39(26):2210-2216. doi: 10.1002/jcc.25534. Epub 2018 Oct 14.

引用本文的文献

MKFGO: integrating multi-source knowledge fusion with pretrained language model for high-accuracy protein function prediction.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf420.

Sequence-Based Prediction for Protein Solvent Accessibility.

Int J Mol Sci. 2025 Jun 11;26(12):5604. doi: 10.3390/ijms26125604.

DeepPredict: a state-of-the-art web server for protein secondary structure and relative solvent accessibility prediction.

Front Bioinform. 2025 Jun 6;5:1607402. doi: 10.3389/fbinf.2025.1607402. eCollection 2025.

Carmna: classification and regression models for nitrogenase activity based on a pretrained large protein language model.

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf197.

Advancements in one-dimensional protein structure prediction using machine learning and deep learning.

Comput Struct Biotechnol J. 2025 Apr 3;27:1416-1430. doi: 10.1016/j.csbj.2025.04.005. eCollection 2025.

PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs).

Biomolecules. 2025 Jan 2;15(1):49. doi: 10.3390/biom15010049.

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

MHTAPred-SS: A Highly Targeted Autoencoder-Driven Deep Multi-Task Learning Framework for Accurate Protein Secondary Structure Prediction.

Int J Mol Sci. 2024 Dec 15;25(24):13444. doi: 10.3390/ijms252413444.

Recent Advances in Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences.

Methods Mol Biol. 2025;2870:1-19. doi: 10.1007/978-1-0716-4213-9_1.

Parameterized hypercomplex convolutional network for accurate protein backbone torsion angle prediction.

Sci Rep. 2024 Nov 8;14(1):27193. doi: 10.1038/s41598-024-77412-8.

本文引用的文献

Single-sequence protein structure prediction using supervised transformer protein language models.

Nat Comput Sci. 2022 Dec;2(12):804-814. doi: 10.1038/s43588-022-00373-3. Epub 2022 Dec 19.

Accurate prediction of protein structures and interactions using a three-track neural network.

Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Bioinformatics. 2021 Oct 25;37(20):3464-3472. doi: 10.1093/bioinformatics/btab316.

Protein domain identification methods and online resources.

Comput Struct Biotechnol J. 2021 Feb 2;19:1145-1153. doi: 10.1016/j.csbj.2021.01.041. eCollection 2021.

Evaluating Protein Transfer Learning with TAPE.

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

ProteinUnet-An efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures.

J Comput Chem. 2021 Jan 5;42(1):50-59. doi: 10.1002/jcc.26432. Epub 2020 Oct 15.

OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networks.

Bioinformatics. 2020 Dec 22;36(20):5021-5026. doi: 10.1093/bioinformatics/btaa629.

SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning.

Genomics Proteomics Bioinformatics. 2019 Dec;17(6):645-656. doi: 10.1016/j.gpb.2019.01.004. Epub 2020 Mar 13.

Modeling aspects of the language of life through transfer-learning protein sequences.

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

无需对齐即可达到基于对齐轮廓的预测蛋白质二级和三级结构性质的准确性。

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献