• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于数据驱动蛋白质工程的蛋白质适应性半监督预测

Semi-supervised prediction of protein fitness for data-driven protein engineering.

作者信息

Olivares-Gil Alicia, Barbero-Aparicio José A, Rodríguez Juan J, Díez-Pastor José F, García-Osorio César, Davari Mehdi D

机构信息

Departamento de Ingeniería Informática, Universidad De Burgos, Avda. Cantabria s/n, Burgos, 09006, Spain.

Department of Bioorganic Chemistry, Leibniz Institute Of Plant Biochemistry, Winberg 3, 06120, Halle, Germany.

出版信息

J Cheminform. 2025 May 31;17(1):88. doi: 10.1186/s13321-025-01029-w.

DOI:10.1186/s13321-025-01029-w
PMID:40450362
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12126886/
Abstract

Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics.

摘要

蛋白质适应性预测在蛋白质工程研究的进展中起着至关重要的作用。然而,蛋白质序列空间的组合复杂性以及实验标记数据的有限可用性阻碍了蛋白质特性的有效优化。利用机器学习方法的数据驱动策略已成为一种有前途的解决方案,但其对标记训练数据集的依赖构成了重大障碍。为了克服这一挑战,在这项工作中,我们探索了将进化相关序列(同源序列)中存在的潜在信息引入训练过程的各种方法。为此,我们基于半监督学习(无监督预处理和包装方法)建立了几种策略,并使用19个包含蛋白质适应性对的数据集进行了全面比较。我们的研究结果表明,利用同源序列中存在的信息可以提高模型的性能,特别是当可用标记序列的数量相当少时。具体而言,基于直接耦合分析(DCA)的序列编码方法与MERGE(一种将进化信息与监督学习相结合的混合回归框架)和支持向量机回归器相结合,优于其他编码方法(PAM250、UniRep、eUniRep)和其他半监督包装方法(Tri-Training Regressor、Co-Training Regressor)。总之,该策略所展示的性能提升标志着朝着用于蛋白质工程任务的更强大、更可靠的预测模型迈出了实质性的一步。这一进展有可能简化用于生物技术和治疗学中各种应用的蛋白质的设计和优化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/46488fd5843e/13321_2025_1029_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/a925f36b1423/13321_2025_1029_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/d331b4be30b0/13321_2025_1029_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/dc80e4ae3a87/13321_2025_1029_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/6fb71aace1dd/13321_2025_1029_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/46488fd5843e/13321_2025_1029_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/a925f36b1423/13321_2025_1029_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/d331b4be30b0/13321_2025_1029_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/dc80e4ae3a87/13321_2025_1029_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/6fb71aace1dd/13321_2025_1029_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/150f/12126886/46488fd5843e/13321_2025_1029_Fig5_HTML.jpg

相似文献

1
Semi-supervised prediction of protein fitness for data-driven protein engineering.用于数据驱动蛋白质工程的蛋白质适应性半监督预测
J Cheminform. 2025 May 31;17(1):88. doi: 10.1186/s13321-025-01029-w.
2
Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort.进化概率和堆叠回归可实现最小化实验投入的数据驱动蛋白质工程。
J Chem Inf Model. 2024 Aug 26;64(16):6350-6360. doi: 10.1021/acs.jcim.4c00704. Epub 2024 Aug 1.
3
Cluster learning-assisted directed evolution.聚类学习辅助的定向进化
Nat Comput Sci. 2021 Dec;1(12):809-818. doi: 10.1038/s43588-021-00168-y. Epub 2021 Dec 9.
4
Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning.基于扩散模型和协同学习的细胞核半监督语义分割
J Med Imaging (Bellingham). 2025 Nov;12(6):061403. doi: 10.1117/1.JMI.12.6.061403. Epub 2025 Mar 20.
5
Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.基于大规模蛋白质语言模型的监督方法的集成学习在蛋白质突变效应预测中的应用。
Int J Mol Sci. 2023 Nov 18;24(22):16496. doi: 10.3390/ijms242216496.
6
Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.蛋白质适应性预测受到语言模型、集成学习和采样方法相互作用的影响。
Pharmaceutics. 2023 Apr 25;15(5):1337. doi: 10.3390/pharmaceutics15051337.
7
A multi-task and multi-channel convolutional neural network for semi-supervised neonatal artefact detection.用于半监督新生儿伪影检测的多任务多通道卷积神经网络。
J Neural Eng. 2023 Mar 14;20(2). doi: 10.1088/1741-2552/acbc4b.
8
Semi-supervised abdominal multi-organ segmentation by object-redrawing.通过对象重绘实现半监督腹部多器官分割
Med Phys. 2024 Nov;51(11):8334-8347. doi: 10.1002/mp.17364. Epub 2024 Aug 21.
9
PolypMixNet: Enhancing semi-supervised polyp segmentation with polyp-aware augmentation.PolypMixNet:利用息肉感知增强进行半监督息肉分割。
Comput Biol Med. 2024 Mar;170:108006. doi: 10.1016/j.compbiomed.2024.108006. Epub 2024 Jan 15.
10
Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors.基于 DNA 甲基化的中枢神经系统肿瘤有监督分类的半监督学习综合研究。
BMC Bioinformatics. 2022 Jun 8;23(1):223. doi: 10.1186/s12859-022-04764-1.

本文引用的文献

1
Machine Learning-Guided Protein Engineering.机器学习引导的蛋白质工程
ACS Catal. 2023 Oct 13;13(21):13863-13895. doi: 10.1021/acscatal.3c02743. eCollection 2023 Nov 3.
2
Learning protein fitness models from evolutionary and assay-labeled data.从进化和实验标记数据中学习蛋白质适应性模型。
Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.
3
PyPEF-An Integrated Framework for Data-Driven Protein Engineering.PyPEF——一个用于数据驱动的蛋白质工程的集成框架。
J Chem Inf Model. 2021 Jul 26;61(7):3463-3476. doi: 10.1021/acs.jcim.1c00099. Epub 2021 Jul 14.
4
Low-N protein engineering with data-efficient deep learning.低蛋白工程与数据高效深度学习。
Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.
5
Machine learning-assisted enzyme engineering.机器学习辅助酶工程。
Methods Enzymol. 2020;643:281-315. doi: 10.1016/bs.mie.2020.05.005. Epub 2020 Jun 12.
6
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
7
Machine-learning-guided directed evolution for protein engineering.基于机器学习的定向进化蛋白质工程。
Nat Methods. 2019 Aug;16(8):687-694. doi: 10.1038/s41592-019-0496-6. Epub 2019 Jul 15.
8
Mutation effects predicted from sequence co-variation.根据序列共变预测的突变效应。
Nat Biotechnol. 2017 Feb;35(2):128-135. doi: 10.1038/nbt.3769. Epub 2017 Jan 16.
9
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.UniRef聚类:一种用于改进序列相似性搜索的全面且可扩展的替代方法。
Bioinformatics. 2015 Mar 15;31(6):926-32. doi: 10.1093/bioinformatics/btu739. Epub 2014 Nov 13.
10
The use of classification trees for bioinformatics.分类树在生物信息学中的应用。
Wiley Interdiscip Rev Data Min Knowl Discov. 2011 Jan;1(1):55-63. doi: 10.1002/widm.14. Epub 2011 Jan 6.