SAPPHIRE：一种基于堆叠的集成学习框架，用于准确预测嗜热蛋白。

SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins.

机构信息

Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.

Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.

出版信息

Comput Biol Med. 2022 Jul;146:105704. doi: 10.1016/j.compbiomed.2022.105704. Epub 2022 Jun 7.

DOI:10.1016/j.compbiomed.2022.105704

PMID:35690478

Abstract

Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.

摘要

嗜热蛋白（TPPs）在蛋白质生物化学和新酶开发领域非常重要。因此，必须迫切开发计算方法来准确快速地识别 TPPs。迄今为止，已经开发了几种用于 TPP 识别的计算方法；但是，在性能和实用性方面仍然存在一些局限性。在这项研究中，我们提出了一种新颖的计算方法 SAPPHIRE，仅使用序列信息即可实现更准确的 TPP 识别，而无需任何结构信息。我们结合了代表不同视角的十二种不同特征编码和六种流行的机器学习算法，训练了 72 个基准模型并提取了 TPP 的关键信息。随后，使用遗传算法结合自我评估报告方法从基准模型中挖掘和选择有信息的预测概率。最后，通过应用最佳特征集构建和优化最终的元预测器 SAPPHIRE。SAPPHIRE 在 10 倍交叉验证测试中的性能表明，与几个基准模型相比，它可以实现卓越的预测性能。此外，SAPPHIRE 的准确率为 0.942，马修斯相关系数为 0.884，分别比当前现有的方法高 7.68%和 5.12%，这在独立测试中得到了证实。预计所提出的计算方法将有助于大规模识别 TPPs 并加速其在食品工业中的应用。代码和数据集可在 https://github.com/plenoi/SAPPHIRE 上获得。