Olivares-Gil Alicia, Barbero-Aparicio José A, Rodríguez Juan J, Díez-Pastor José F, García-Osorio César, Davari Mehdi D
Departamento de Ingeniería Informática, Universidad De Burgos, Avda. Cantabria s/n, Burgos, 09006, Spain.
Department of Bioorganic Chemistry, Leibniz Institute Of Plant Biochemistry, Winberg 3, 06120, Halle, Germany.
J Cheminform. 2025 May 31;17(1):88. doi: 10.1186/s13321-025-01029-w.
Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics.
蛋白质适应性预测在蛋白质工程研究的进展中起着至关重要的作用。然而,蛋白质序列空间的组合复杂性以及实验标记数据的有限可用性阻碍了蛋白质特性的有效优化。利用机器学习方法的数据驱动策略已成为一种有前途的解决方案,但其对标记训练数据集的依赖构成了重大障碍。为了克服这一挑战,在这项工作中,我们探索了将进化相关序列(同源序列)中存在的潜在信息引入训练过程的各种方法。为此,我们基于半监督学习(无监督预处理和包装方法)建立了几种策略,并使用19个包含蛋白质适应性对的数据集进行了全面比较。我们的研究结果表明,利用同源序列中存在的信息可以提高模型的性能,特别是当可用标记序列的数量相当少时。具体而言,基于直接耦合分析(DCA)的序列编码方法与MERGE(一种将进化信息与监督学习相结合的混合回归框架)和支持向量机回归器相结合,优于其他编码方法(PAM250、UniRep、eUniRep)和其他半监督包装方法(Tri-Training Regressor、Co-Training Regressor)。总之,该策略所展示的性能提升标志着朝着用于蛋白质工程任务的更强大、更可靠的预测模型迈出了实质性的一步。这一进展有可能简化用于生物技术和治疗学中各种应用的蛋白质的设计和优化。