Lv Zhibin, Wei Mingxuan, Pei Hongdi, Peng Shiyu, Li Mingxin, Jiang Liangzhen
College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China.
College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China.
Comput Biol Med. 2025 Feb;185:109598. doi: 10.1016/j.compbiomed.2024.109598. Epub 2024 Dec 20.
Thermophilic proteins, mesophiles proteins and psychrophilic proteins have wide industrial applications, as enzymes with different optimal temperatures are often needed for different purposes. Convenient methods are needed to determine the optimal temperatures for proteins; however, laboratory methods for this purpose are time-consuming and laborious, and existing machine learning methods can only perform binary classification of thermophilic and non-thermophilic proteins, or psychrophilic and non-psychrophilic proteins. Here, we developed a deep learning model, PSTP-BERT, based on protein sequences that can directly perform Three classes identification of thermophilic, mesophilic, and psychrophilic proteins. By comparing BERT-bfd with other deep learning models using five-fold cross-validation, we found that BERT-bfd-extracted features achieved the highest accuracy under six classifiers. Furthermore, to improve the model's accuracy, we used SMOTE (synthetic minority oversampling technique) to balance the dataset and light gradient-boosting machine to rank BERT-bfd-extracted features according to their weights. We obtained the best-performing model with five-fold cross-validation accuracy of 89.59 % and independent test accuracy of 85.42 %. The performance of the PSTP-BERT is significantly better than that of existing models in Three classes identification task. In order to compare with previous binary classification models, we used PSTP-BERT to perform binary classification tasks of thermophilic and non-thermophilic protein, and psychrophilic and non-psychrophilic protein on an independent test set. PSTP-BERT achieved the highest accuracy on both binary classification tasks, with an accuracy of 93.33 % for thermophilic protein binary classification and 88.33 % for psychrophilic protein binary classification. The accuracy of the independent test of the model can reach between 89.8 % and 92.9 % after training and optimization of the training set with different sequence similarities, and the prediction accuracy of the new data can exceed 97 %. For the convenience of future researchers to use and reference, we have uploaded source code of PSTP-BERT to GitHub.
嗜热蛋白、嗜温蛋白和嗜冷蛋白具有广泛的工业应用,因为不同的目的通常需要具有不同最适温度的酶。需要简便的方法来确定蛋白质的最适温度;然而,为此目的的实验室方法既耗时又费力,并且现有的机器学习方法只能对嗜热蛋白和非嗜热蛋白,或嗜冷蛋白和非嗜冷蛋白进行二元分类。在此,我们基于蛋白质序列开发了一种深度学习模型PSTP-BERT,它可以直接对嗜热、嗜温和嗜冷蛋白进行三类识别。通过使用五折交叉验证将BERT-bfd与其他深度学习模型进行比较,我们发现BERT-bfd提取的特征在六个分类器下达到了最高准确率。此外,为了提高模型的准确率,我们使用SMOTE(合成少数过采样技术)来平衡数据集,并使用轻梯度提升机根据其权重对BERT-bfd提取的特征进行排序。我们获得了性能最佳的模型,其五折交叉验证准确率为89.59%,独立测试准确率为85.42%。在三类识别任务中,PSTP-BERT的性能明显优于现有模型。为了与先前的二元分类模型进行比较,我们使用PSTP-BERT在独立测试集上执行嗜热蛋白和非嗜热蛋白,以及嗜冷蛋白和非嗜冷蛋白的二元分类任务。PSTP-BERT在这两个二元分类任务上均达到了最高准确率,嗜热蛋白二元分类的准确率为93.33%,嗜冷蛋白二元分类的准确率为88.33%。在用不同序列相似性的训练集进行训练和优化后,模型独立测试的准确率可以达到89.8%至92.9%之间,新数据的预测准确率可以超过97%。为了方便未来的研究人员使用和参考,我们已将PSTP-BERT的源代码上传到GitHub。