Suppr超能文献

一种基于可迁移性的蛋白质表示学习评估方法。

A Transferability-Based Method for Evaluating the Protein Representation Learning.

作者信息

Hu Fan, Zhang Weihong, Huang Huazhen, Li Wang, Li Yang, Yin Peng

出版信息

IEEE J Biomed Health Inform. 2024 May;28(5):3158-3166. doi: 10.1109/JBHI.2024.3370680. Epub 2024 May 6.

Abstract

Self-supervised pre-trained language models have recently risen as a powerful approach in learning protein representations, showing exceptional effectiveness in various biological tasks, such as drug discovery. Amidst the evolving trend in protein language model development, there is an observable shift towards employing large-scale multimodal and multitask models. However, the predominant reliance on empirical assessments using specific benchmark datasets for evaluating these models raises concerns about the comprehensiveness and efficiency of current evaluation methods. Addressing this gap, our study introduces a novel quantitative approach for estimating the performance of transferring multi-task pre-trained protein representations to downstream tasks. This transferability-based method is designed to quantify the similarities in latent space distributions between pre-trained features and those fine-tuned for downstream tasks. It encompasses a broad spectrum, covering multiple domains and a variety of heterogeneous tasks. To validate this method, we constructed a diverse set of protein-specific pre-training tasks. The resulting protein representations were then evaluated across several downstream biological tasks. Our experimental results demonstrate a robust correlation between the transferability scores obtained using our method and the actual transfer performance observed. This significant correlation highlights the potential of our method as a more comprehensive and efficient tool for evaluating protein representation learning.

摘要

自监督预训练语言模型最近作为一种学习蛋白质表征的强大方法兴起,在各种生物学任务(如药物发现)中显示出卓越的有效性。在蛋白质语言模型发展的不断演变趋势中,存在一种向采用大规模多模态和多任务模型的明显转变。然而,主要依赖使用特定基准数据集进行实证评估来评价这些模型,引发了对当前评估方法的全面性和效率的担忧。为弥补这一差距,我们的研究引入了一种新颖的定量方法,用于估计将多任务预训练蛋白质表征转移到下游任务的性能。这种基于可转移性的方法旨在量化预训练特征与为下游任务微调的特征在潜在空间分布上的相似性。它涵盖范围广泛,包括多个领域和各种异构任务。为验证此方法,我们构建了一组多样的蛋白质特定预训练任务。然后在几个下游生物学任务中评估所得的蛋白质表征。我们的实验结果表明,使用我们的方法获得的可转移性分数与观察到的实际转移性能之间存在强相关性。这种显著的相关性突出了我们的方法作为评估蛋白质表征学习的更全面、高效工具的潜力。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验