Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
Institute of Public Safety Research, Department of Engineering Physics, Tsinghua University, Beijing 100084, China.
Int J Mol Sci. 2024 Nov 5;25(22):11866. doi: 10.3390/ijms252211866.
Thermophilic proteins maintain their stability and functionality under extreme high-temperature conditions, making them of significant importance in both fundamental biological research and biotechnological applications. In this study, we developed a machine learning-based thermophilic protein GradientBoosting prediction model, TPGPred, designed to predict thermophilic proteins by leveraging a large-scale dataset of both thermophilic and non-thermophilic protein sequences. By combining various machine learning algorithms with feature-engineering methods, we systematically evaluated the classification performance of the model, identifying the optimal feature combinations and classification models. Trained on a large public dataset of 5652 samples, TPGPred achieved an Accuracy score greater than 0.95 and an Area Under the Receiver Operating Characteristic Curve (AUROC) score greater than 0.98 on an independent test set of 627 samples. Our findings offer new insights into the identification and classification of thermophilic proteins and provide a solid foundation for their industrial application development.
嗜热蛋白在极端高温条件下保持其稳定性和功能性,因此它们在基础生物学研究和生物技术应用中都具有重要意义。在这项研究中,我们开发了一种基于机器学习的嗜热蛋白 GradientBoosting 预测模型 TPGPred,旨在通过利用大规模的嗜热和非嗜热蛋白序列数据集来预测嗜热蛋白。通过将各种机器学习算法与特征工程方法相结合,我们系统地评估了模型的分类性能,确定了最佳的特征组合和分类模型。在一个包含 5652 个样本的大型公共数据集上进行训练后,TPGPred 在一个包含 627 个样本的独立测试集上的准确率大于 0.95,接收器操作特征曲线下的面积(AUROC)大于 0.98。我们的研究结果为嗜热蛋白的鉴定和分类提供了新的见解,并为其工业应用开发奠定了坚实的基础。