• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用蛋白质语言模型嵌入实现卓越的蛋白质嗜热性预测。

Superior protein thermophilicity prediction with protein language model embeddings.

作者信息

Haselbeck Florian, John Maura, Zhang Yuqi, Pirnay Jonathan, Fuenzalida-Werner Juan Pablo, Costa Rubén D, Grimm Dominik G

机构信息

Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany.

Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, 94315 Straubing, Germany.

出版信息

NAR Genom Bioinform. 2023 Oct 11;5(4):lqad087. doi: 10.1093/nargab/lqad087. eCollection 2023 Dec.

DOI:10.1093/nargab/lqad087
PMID:37829176
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10566323/
Abstract

Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a tein nguage model-based ophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew's correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.

摘要

蛋白质热稳定性在生物技术的许多领域都很重要,包括酶工程和蛋白质混合光电子学。不断增长的蛋白质数据库以及不同温度下的稳定性信息使得机器学习模型得以训练,以预测蛋白质是否为嗜热蛋白。这些预测可以通过引导研究人员找到更有前景的候选蛋白来降低成本并加速开发过程。现有的预测蛋白质嗜热性的模型主要依赖于从物理化学性质衍生的特征。最近,直接使用序列信息的现代蛋白质语言模型在多项任务中展现出卓越性能。在本研究中,我们使用基于蛋白质语言模型的嗜热性预测器ProLaTherm评估蛋白质语言模型嵌入用于嗜热性预测的有效性。在多个评估指标上,ProLaTherm显著优于所有基于特征、序列和文献的比较对象。在马修斯相关系数方面,在嵌套交叉验证设置中,ProLaTherm比第二优的竞争对手高出18.1%。使用与训练数据中的物种不重叠的物种的蛋白质,ProLaTherm比所有竞争对手至少高出9.7%。在这些数据上,它仅将一个非嗜热蛋白误分类为嗜热蛋白。此外,它正确识别了我们测试集中所有最佳生长温度高于70°C的嗜热蛋白中的97.4%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/077439cad946/lqad087fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/b94163dcb2a2/lqad087fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/7af4bfd5185d/lqad087fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/077439cad946/lqad087fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/b94163dcb2a2/lqad087fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/7af4bfd5185d/lqad087fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aac/10566323/077439cad946/lqad087fig3.jpg

相似文献

1
Superior protein thermophilicity prediction with protein language model embeddings.利用蛋白质语言模型嵌入实现卓越的蛋白质嗜热性预测。
NAR Genom Bioinform. 2023 Oct 11;5(4):lqad087. doi: 10.1093/nargab/lqad087. eCollection 2023 Dec.
2
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
3
Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification.利用蛋白质语言模型嵌入和逻辑回归进行高效准确的嗜酸性蛋白质计算分类。
Comput Biol Chem. 2024 Oct;112:108163. doi: 10.1016/j.compbiolchem.2024.108163. Epub 2024 Jul 26.
4
An additional aromatic interaction improves the thermostability and thermophilicity of a mesophilic family 11 xylanase: structural basis and molecular study.一种额外的芳香族相互作用提高了嗜温性11家族木聚糖酶的热稳定性和嗜热性:结构基础与分子研究
Protein Sci. 2000 Mar;9(3):466-75. doi: 10.1110/ps.9.3.466.
5
TemStaPro: protein thermostability prediction using sequence representations from protein language models.TemStaPro:使用蛋白质语言模型的序列表示进行蛋白质热稳定性预测。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae157.
6
Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.基于概率潜在语义索引的核转位信号预测核蛋白。
BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.
7
SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins.SAPPHIRE:一种基于堆叠的集成学习框架,用于准确预测嗜热蛋白。
Comput Biol Med. 2022 Jul;146:105704. doi: 10.1016/j.compbiomed.2022.105704. Epub 2022 Jun 7.
8
Thermozymes.嗜热酶
Biotechnol Annu Rev. 1996;2:1-83. doi: 10.1016/s1387-2656(08)70006-1.
9
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
10
A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features.一种基于简化氨基酸和混合特征的嗜热蛋白预测方法。
Front Bioeng Biotechnol. 2020 May 5;8:285. doi: 10.3389/fbioe.2020.00285. eCollection 2020.

引用本文的文献

1
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
2
Prediction and design of thermostable proteins with a desired melting temperature.具有所需解链温度的热稳定蛋白质的预测与设计。
Sci Rep. 2025 May 14;15(1):16683. doi: 10.1038/s41598-025-98667-9.
3
DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.

本文引用的文献

1
DeepTP: A Deep Learning Model for Thermophilic Protein Prediction.深度 TP:一种用于耐热蛋白预测的深度学习模型。
Int J Mol Sci. 2023 Jan 22;24(3):2217. doi: 10.3390/ijms24032217.
2
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
3
UniProt: the Universal Protein Knowledgebase in 2023.UniProt:2023 年的通用蛋白质知识库。
DNA序列分析全景:对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。
Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.
4
Transitioning from wet lab to artificial intelligence: a systematic review of AI predictors in CRISPR.从湿实验室到人工智能的转变:对CRISPR中人工智能预测因子的系统综述
J Transl Med. 2025 Feb 4;23(1):153. doi: 10.1186/s12967-024-06013-w.
5
RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景:任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述
Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.
6
HaloClass: Salt-Tolerant Protein Classification with Protein Language Models.HaloClass:基于蛋白质语言模型的耐盐蛋白分类。
Protein J. 2024 Dec;43(6):1035-1044. doi: 10.1007/s10930-024-10236-7. Epub 2024 Oct 21.
7
TEMPRO: nanobody melting temperature estimation model using protein embeddings.TEMPRO:使用蛋白质嵌入的纳米体融解温度预估模型。
Sci Rep. 2024 Aug 17;14(1):19074. doi: 10.1038/s41598-024-70101-6.
8
Guiding questions to avoid data leakage in biological machine learning applications.指导问题以避免生物机器学习应用中的数据泄露。
Nat Methods. 2024 Aug;21(8):1444-1453. doi: 10.1038/s41592-024-02362-y. Epub 2024 Aug 9.
9
TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms.TemBERTure:利用深度学习和注意力机制推进蛋白质热稳定性预测
Bioinform Adv. 2024 Jul 13;4(1):vbae103. doi: 10.1093/bioadv/vbae103. eCollection 2024.
10
TemStaPro: protein thermostability prediction using sequence representations from protein language models.TemStaPro:使用蛋白质语言模型的序列表示进行蛋白质热稳定性预测。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae157.
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052.
4
SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins.SAPPHIRE:一种基于堆叠的集成学习框架,用于准确预测嗜热蛋白。
Comput Biol Med. 2022 Jul;146:105704. doi: 10.1016/j.compbiomed.2022.105704. Epub 2022 Jun 7.
5
Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins.用于预测和分析嗜热蛋白的基于机器学习的预测器的实证比较与分析
EXCLI J. 2022 Mar 2;21:554-570. doi: 10.17179/excli2022-4723. eCollection 2022.
6
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。
Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.
7
iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets.iFeatureOmega:一个综合性平台,用于对分子序列、结构和配体数据集的特征进行工程设计、可视化和分析。
Nucleic Acids Res. 2022 Jul 5;50(W1):W434-W447. doi: 10.1093/nar/gkac351.
8
iThermo: A Sequence-Based Model for Identifying Thermophilic Proteins Using a Multi-Feature Fusion Strategy.iThermo:一种基于序列的模型,用于使用多特征融合策略识别嗜热蛋白。
Front Microbiol. 2022 Feb 22;13:790063. doi: 10.3389/fmicb.2022.790063. eCollection 2022.
9
TMPpred: A support vector machine-based thermophilic protein identifier.TMPpred:一种基于支持向量机的嗜热蛋白鉴定器。
Anal Biochem. 2022 May 15;645:114625. doi: 10.1016/j.ab.2022.114625. Epub 2022 Feb 23.
10
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.