Suppr超能文献

一个用于在功能相关预测任务上评估蛋白质语言模型的基准测试平台。

A Benchmarking Platform for Assessing Protein Language Models on Function-Related Prediction Tasks.

作者信息

Çevrim Elif, Yiğit Melih Gökay, Ulusoy Erva, Yılmaz Ardan, Doğan Tunca

机构信息

Biological Data Science Lab, Department of Computer Engineering, Hacettepe University, Ankara, Turkey.

Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey.

出版信息

Methods Mol Biol. 2025;2947:241-268. doi: 10.1007/978-1-0716-4662-5_14.

Abstract

Proteins play a crucial role in almost all biological processes, serving as the building blocks of life and mediating various cellular functions, from enzymatic reactions to immune responses. Accurate annotation of protein functions is essential for advancing our understanding of biological systems and developing innovative biotechnological applications and therapeutic strategies. To predict protein function, researchers primarily rely on classical homology-based methods, which use evolutionary relationships, and increasingly on machine learning (ML) approaches. Lately, protein language models (PLMs) have gained prominence; these models leverage specialized deep learning architectures to effectively capture intricate relationships between sequence, structure, and function. We recently conducted a comprehensive benchmarking study to evaluate diverse protein representations (i.e., classical approaches and PLMs) and discuss their trade-offs. The current work introduces the Protein Representation Benchmark-PROBE tool, a benchmarking framework designed to evaluate protein representations on function-related prediction tasks. Here, we provide a detailed protocol for running the framework via the GitHub repository and accessing our newly developed user-friendly web service. PROBE encompasses four core tasks: semantic similarity inference, ontology-based function prediction, drug target family classification, and protein-protein binding affinity estimation. We demonstrate PROBE's usage through a new use case evaluating ESM2 and three recent multimodal PLMs-ESM3, ProstT5, and SaProt-highlighting their ability to integrate diverse data types, including sequence and structural information. This study underscores the potential of protein language models in advancing protein function prediction and serves as a valuable tool for both PLM developers and users.

摘要

蛋白质在几乎所有生物过程中都起着至关重要的作用,作为生命的基石并介导各种细胞功能,从酶促反应到免疫反应。准确注释蛋白质功能对于增进我们对生物系统的理解以及开发创新的生物技术应用和治疗策略至关重要。为了预测蛋白质功能,研究人员主要依赖基于经典同源性的方法,这些方法利用进化关系,并且越来越多地依赖机器学习(ML)方法。最近,蛋白质语言模型(PLM)受到了广泛关注;这些模型利用专门的深度学习架构来有效捕捉序列、结构和功能之间的复杂关系。我们最近进行了一项全面的基准研究,以评估各种蛋白质表示(即经典方法和PLM)并讨论它们的优缺点。当前的工作介绍了蛋白质表示基准测试 - PROBE工具,这是一个旨在评估蛋白质表示在功能相关预测任务上的基准框架。在这里,我们提供了一个详细的协议,用于通过GitHub仓库运行该框架并访问我们新开发的用户友好型网络服务。PROBE包含四个核心任务:语义相似性推断、基于本体的功能预测、药物靶标家族分类和蛋白质 - 蛋白质结合亲和力估计。我们通过一个评估ESM2和三个最近的多模态PLM(ESM3、ProstT5和SaProt)的新用例展示了PROBE的用法,突出了它们整合包括序列和结构信息在内的各种数据类型的能力。这项研究强调了蛋白质语言模型在推进蛋白质功能预测方面的潜力,并为PLM开发者和用户提供了一个有价值的工具。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验