Suppr超能文献

无需对齐即可达到基于对齐轮廓的预测蛋白质二级和三级结构性质的准确性。

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.

机构信息

Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.

Institute for Glycomics, Griffith University, Parklands Dr. Southport, Goldcoast, QLD, 4222, Australia.

出版信息

Sci Rep. 2022 May 9;12(1):7607. doi: 10.1038/s41598-022-11684-w.

Abstract

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.

摘要

蛋白质语言模型已成为丰富序列信息和改进下游预测任务(如生物物理、结构和功能特性)的替代方法,例如多重序列比对。在这里,我们展示了一种名为 SPOT-1D-LM 的方法,该方法将传统的独热编码与来自两种不同语言模型(ProtTrans 和 ESM-1b)的嵌入相结合,用于输入,并在预测蛋白质 1D 二级和三级结构特性方面取得了比基于单序列的技术更大的准确性飞跃,包括所有六个测试集(TEST2018、TEST2020、Neff1-2020、CASP12-FM、CASP13-FM 和 CASP14-FM)的骨架扭转角、溶剂可及性和接触数。更重要的是,对于具有同源序列的蛋白质,它的性能可与基于构象的方法相媲美。例如,对于 TEST2018 和 TEST2020 蛋白质的三态二级结构(SS3)预测,SPOT-1D-LM 的准确率分别为 86.7%和 79.8%,而基于单序列的方法 SPOT-1D-Single 的准确率分别为 74.3%和 73.4%,基于构象的方法 SPOT-1D 的准确率分别为 86.2%和 80.5%。对于没有同源序列的蛋白质(Neff1-2020),SPOT-1D-LM 的 SS3 准确率为 80.41%,分别比 SPOT-1D-Single 和 SPOT-1D 高 3.8%和 8.3%。鉴于其快速的性能,SPOT-1D-LM 有望用于全基因组分析。此外,无需序列比对即可对二级和三级结构特性(如骨架角度和溶剂可及性)进行高精度预测,这表明在 AlphaFold2 时代之后仍然存在的剩余障碍情况下,无需同源序列即可实现蛋白质结构的高精度预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d08/9085874/263aa165a560/41598_2022_11684_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验