• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SAPPHIRE:一种基于堆叠的集成学习框架,用于准确预测嗜热蛋白。

SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins.

机构信息

Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.

Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.

出版信息

Comput Biol Med. 2022 Jul;146:105704. doi: 10.1016/j.compbiomed.2022.105704. Epub 2022 Jun 7.

DOI:10.1016/j.compbiomed.2022.105704
PMID:35690478
Abstract

Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.

摘要

嗜热蛋白(TPPs)在蛋白质生物化学和新酶开发领域非常重要。因此,必须迫切开发计算方法来准确快速地识别 TPPs。迄今为止,已经开发了几种用于 TPP 识别的计算方法;但是,在性能和实用性方面仍然存在一些局限性。在这项研究中,我们提出了一种新颖的计算方法 SAPPHIRE,仅使用序列信息即可实现更准确的 TPP 识别,而无需任何结构信息。我们结合了代表不同视角的十二种不同特征编码和六种流行的机器学习算法,训练了 72 个基准模型并提取了 TPP 的关键信息。随后,使用遗传算法结合自我评估报告方法从基准模型中挖掘和选择有信息的预测概率。最后,通过应用最佳特征集构建和优化最终的元预测器 SAPPHIRE。SAPPHIRE 在 10 倍交叉验证测试中的性能表明,与几个基准模型相比,它可以实现卓越的预测性能。此外,SAPPHIRE 的准确率为 0.942,马修斯相关系数为 0.884,分别比当前现有的方法高 7.68%和 5.12%,这在独立测试中得到了证实。预计所提出的计算方法将有助于大规模识别 TPPs 并加速其在食品工业中的应用。代码和数据集可在 https://github.com/plenoi/SAPPHIRE 上获得。

相似文献

1
SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins.SAPPHIRE:一种基于堆叠的集成学习框架,用于准确预测嗜热蛋白。
Comput Biol Med. 2022 Jul;146:105704. doi: 10.1016/j.compbiomed.2022.105704. Epub 2022 Jun 7.
2
StackDPPIV: A novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides.StackDPPIV:一种用于准确预测二肽基肽酶 IV(DPP-IV)抑制肽的新型计算方法。
Methods. 2022 Aug;204:189-198. doi: 10.1016/j.ymeth.2021.12.001. Epub 2021 Dec 6.
3
Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins.用于预测和分析嗜热蛋白的基于机器学习的预测器的实证比较与分析
EXCLI J. 2022 Mar 2;21:554-570. doi: 10.17179/excli2022-4723. eCollection 2022.
4
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides.一种新的基于序列的预测器,用于使用二肽的估计倾向分数来识别和描述嗜热蛋白。
Sci Rep. 2021 Dec 10;11(1):23782. doi: 10.1038/s41598-021-03293-w.
5
SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins.SCORPION 是一个基于堆叠的集成学习框架,用于准确预测噬菌体病毒蛋白。
Sci Rep. 2022 Mar 8;12(1):4106. doi: 10.1038/s41598-022-08173-5.
6
PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.PeNGaRoo,一种组合梯度提升和集成学习框架,用于预测非经典分泌蛋白。
Bioinformatics. 2020 Feb 1;36(3):704-712. doi: 10.1093/bioinformatics/btz629.
7
StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens.StackTTCA:一种基于堆叠集成学习的框架,用于准确、高通量地鉴定肿瘤 T 细胞抗原。
BMC Bioinformatics. 2023 Jul 28;24(1):301. doi: 10.1186/s12859-023-05421-x.
8
StackPR is a new computational approach for large-scale identification of progesterone receptor antagonists using the stacking strategy.StackPR 是一种使用堆叠策略进行大规模鉴定孕激素受体拮抗剂的新计算方法。
Sci Rep. 2022 Sep 30;12(1):16435. doi: 10.1038/s41598-022-20143-5.
9
STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction.STALLION:一种基于堆叠的集成学习框架,用于预测细菌赖氨酸乙酰化位点。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab376.
10
NEPTUNE: A novel computational approach for accurate and large-scale identification of tumor homing peptides.NEPTUNE:一种用于准确、大规模鉴定肿瘤归巢肽的新型计算方法。
Comput Biol Med. 2022 Sep;148:105700. doi: 10.1016/j.compbiomed.2022.105700. Epub 2022 Jun 7.

引用本文的文献

1
Prediction and design of thermostable proteins with a desired melting temperature.具有所需解链温度的热稳定蛋白质的预测与设计。
Sci Rep. 2025 May 14;15(1):16683. doi: 10.1038/s41598-025-98667-9.
2
M3S-GRPred: a novel ensemble learning approach for the interpretable prediction of glucocorticoid receptor antagonists using a multi-step stacking strategy.M3S-GRPred:一种使用多步堆叠策略对糖皮质激素受体拮抗剂进行可解释预测的新型集成学习方法。
BMC Bioinformatics. 2025 Apr 30;26(1):117. doi: 10.1186/s12859-025-06132-1.
3
Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model.
通过几何图学习和语言模型准确预测微生物蛋白质的最佳条件。
Commun Biol. 2024 Dec 29;7(1):1709. doi: 10.1038/s42003-024-07436-3.
4
Enhancing the erythritol production of Yarrowia lipolytica by high-throughput screening based on highly sensitive artificial sensor and anchor protein cwp2.基于高灵敏度人工传感器和锚定蛋白cwp2的高通量筛选提高解脂耶氏酵母的赤藓糖醇产量
J Ind Microbiol Biotechnol. 2024 Jan 9;51. doi: 10.1093/jimb/kuae045.
5
A novel meta learning based stacked approach for diagnosis of thyroid syndrome.一种基于元学习的新型堆叠方法用于甲状腺综合征的诊断。
PLoS One. 2024 Nov 1;19(11):e0312313. doi: 10.1371/journal.pone.0312313. eCollection 2024.
6
MetaCGRP is a high-precision meta-model for large-scale identification of CGRP inhibitors using multi-view information.MetaCGRP 是一种高精度的元模型,用于使用多视图信息大规模识别 CGRP 抑制剂。
Sci Rep. 2024 Oct 21;14(1):24764. doi: 10.1038/s41598-024-75487-x.
7
Guiding questions to avoid data leakage in biological machine learning applications.指导问题以避免生物机器学习应用中的数据泄露。
Nat Methods. 2024 Aug;21(8):1444-1453. doi: 10.1038/s41592-024-02362-y. Epub 2024 Aug 9.
8
TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms.TemBERTure:利用深度学习和注意力机制推进蛋白质热稳定性预测
Bioinform Adv. 2024 Jul 13;4(1):vbae103. doi: 10.1093/bioadv/vbae103. eCollection 2024.
9
Long extrachromosomal circular DNA identification by fusing sequence-derived features of physicochemical properties and nucleotide distribution patterns.通过融合物理化学性质和核苷酸分布模式的序列衍生特征来鉴定长链染色体外环状DNA
Sci Rep. 2024 Apr 24;14(1):9466. doi: 10.1038/s41598-024-57457-5.
10
TemStaPro: protein thermostability prediction using sequence representations from protein language models.TemStaPro:使用蛋白质语言模型的序列表示进行蛋白质热稳定性预测。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae157.