Institute of Biotechnology, Life Sciences Center, Vilnius University, LT-10257 Vilnius, Lithuania.
Institute of Computer Science, Faculty of Mathematics and Informatics, Vilnius University, LT-08303 Vilnius, Lithuania.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae157.
Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable the development of more versatile thermostability predictors for multiple ranges of temperatures.
We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data.
TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.
从蛋白质序列可靠地预测其热稳定性对于学术和工业研究都具有重要价值。这个预测问题可以通过机器学习来解决,并利用最近深度学习方法在序列分析方面的蓬勃发展。这些方法可以促进在更多数据上进行训练,并有可能为多个温度范围开发更通用的热稳定性预测器。
我们应用迁移学习的原理,使用来自输入蛋白质序列的蛋白质语言模型 (pLM) 生成的嵌入来预测蛋白质的热稳定性。我们使用了经过数亿个已知序列预训练的大型 pLM。这些模型的嵌入使我们能够使用从具有注释生长温度的生物体中收集的超过 100 万个序列来高效地训练和验证高性能的预测方法。我们的方法 TemStaPro(蛋白质稳定性温度)用于预测 CRISPR-Cas 类 II 效应蛋白 (C2EP) 的热稳定性。预测表明,C2EP 群体之间在热稳定性方面存在明显差异,并且与先前发表的和我们新获得的实验数据基本一致。
TemStaPro 软件和相关数据可从 https://github.com/ievapudz/TemStaPro 和 https://doi.org/10.5281/zenodo.7743637 免费获得。