Bikias Thomas, Stamkopoulos Evangelos, Reddy Sai T
Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.
Botnar Institute of Immune Engineering, Basel, Switzerland.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf381.
Protein language models (PLMs) have emerged as a useful resource for protein engineering applications. Transfer learning (TL) leverages pre-trained parameters to extract features to train machine learning models or adjust the weights of PLMs for novel tasks via fine-tuning (FT) through back-propagation. TL methods have shown potential for enhancing protein predictions performance when paired with PLMs, however there is a notable lack of comparative analyses that benchmark TL methods applied to state-of-the-art PLMs, identify optimal strategies for transferring knowledge and determine the most suitable approach for specific tasks. Here, we report PLMFit, a benchmarking study that combines, three state-of-the-art PLMs (ESM2, ProGen2, ProteinBert), with three TL methods (feature extraction, low-rank adaptation, bottleneck adapters) for five protein engineering datasets. We conducted over >3150 in silico experiments, altering PLM sizes and layers, TL hyperparameters and different training procedures. Our experiments reveal three key findings: (i) utilizing a partial fraction of PLM for TL does not detrimentally impact performance, (ii) the choice between feature extraction (FE) and fine-tuning is primarily dictated by the amount and diversity of data, and (iii) FT is most effective when generalization is necessary and only limited data is available. We provide PLMFit as an open-source software package, serving as a valuable resource for the scientific community to facilitate the FE and FT of PLMs for various applications.
蛋白质语言模型(PLMs)已成为蛋白质工程应用的有用资源。迁移学习(TL)利用预训练参数来提取特征,以训练机器学习模型,或通过反向传播微调(FT)来调整PLMs的权重以用于新任务。当与PLMs结合使用时,TL方法已显示出提高蛋白质预测性能的潜力,然而,明显缺乏对应用于最先进PLMs的TL方法进行比较分析,确定知识转移的最佳策略,并确定特定任务的最合适方法。在这里,我们报告了PLMFit,这是一项基准研究,它将三种最先进的PLMs(ESM2、ProGen2、ProteinBert)与三种TL方法(特征提取、低秩适应、瓶颈适配器)结合用于五个蛋白质工程数据集。我们进行了超过3150次计算机模拟实验,改变了PLM的大小和层数、TL超参数以及不同的训练程序。我们的实验揭示了三个关键发现:(i)将PLM的一部分用于TL不会对性能产生不利影响;(ii)特征提取(FE)和微调之间的选择主要取决于数据的数量和多样性;(iii)当需要泛化且只有有限的数据可用时,FT最有效。我们将PLMFit作为一个开源软件包提供,作为科学界的宝贵资源,以促进PLMs在各种应用中的FE和FT。