中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。

Medium-sized protein language models perform well at transfer learning on realistic datasets.

作者信息

Vieira Luiz C, Handojo Morgan L, Wilke Claus O

机构信息

Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA.

出版信息

Sci Rep. 2025 Jul 1;15(1):21400. doi: 10.1038/s41598-025-05674-x.

DOI:10.1038/s41598-025-05674-x

PMID:40594749

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12217344/

Abstract

Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various ESM-style models across multiple biological datasets to assess the impact of model size on transfer learning via feature extraction. Surprisingly, we found that larger models do not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts-ESM-2 15B and ESM C 6B-despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.

摘要

蛋白质语言模型（pLMs）能够深入洞察蛋白质的进化和结构特性。虽然更大的模型，如拥有150亿参数的模型ESM-2，有望捕捉序列空间中更复杂的模式，但由于其高维度和高计算成本，也带来了实际挑战。我们系统地评估了各种ESM风格模型在多个生物学数据集上的性能，以评估模型大小对通过特征提取进行迁移学习的影响。令人惊讶的是，我们发现更大的模型不一定比小模型表现更好，特别是在数据有限的情况下。中等大小的模型，如ESM-2 650M和ESM C 600M，表现出始终如一的良好性能，尽管比它们更大的对应模型ESM-2 15B和ESM C 6B小很多倍，但仅略落后于它们。此外，我们比较了迁移学习前压缩嵌入的各种方法，发现平均嵌入始终优于其他压缩方法。总之，具有平均嵌入的ESM C 600M在性能和效率之间提供了最佳平衡，使其成为实际生物学应用中迁移学习的实用且可扩展的选择。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。

Medium-sized protein language models perform well at transfer learning on realistic datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。

Medium-sized protein language models perform well at transfer learning on realistic datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献