Suppr超能文献

S-PLM:通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型

S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure.

作者信息

Wang Duolin, Pourmirzaei Mahdi, Abbas Usman L, Zeng Shuai, Manshour Negin, Esmaili Farzaneh, Poudel Biplab, Jiang Yuexu, Shao Qing, Chen Jin, Xu Dong

机构信息

Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, 65211, USA.

Chemical & Materials Engineering, University of Kentucky, Lexington, KY, 40506, USA.

出版信息

Adv Sci (Weinh). 2025 Feb;12(5):e2404212. doi: 10.1002/advs.202404212. Epub 2024 Dec 12.

Abstract

Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S-PLM is introduced as a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S-PLM for diverse downstream protein prediction tasks. The results demonstrate S-PLM's superior performance over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.

摘要

蛋白质在各种生物和工程过程中发挥着至关重要的作用。大型蛋白质语言模型(PLMs)通过加速蛋白质功能的确定以及设计具有所需功能的蛋白质,展现出重塑蛋白质研究的巨大潜力。PLMs的预测和设计能力依赖于从蛋白质序列中获得的表示。然而,大多数PLMs缺乏关键的三维结构信息,这限制了PLMs在各种应用中的预测能力,尤其是那些严重依赖三维结构的应用。为了解决这个问题,引入了S-PLM,这是一种三维结构感知的PLM,它利用多视图对比学习在协调的潜在空间中对齐蛋白质的序列和三维结构。S-PLM将Swin-Transformer应用于AlphaFold预测的蛋白质结构,以嵌入结构信息,并将其融合到来自ESM2的基于序列的嵌入中。此外,还提供了一个轻量级调优工具库,以使S-PLM适用于各种下游蛋白质预测任务。结果表明,在所有蛋白质聚类和分类任务上,S-PLM比仅基于序列的PLMs具有更优越的性能,在需要序列和结构输入的情况下,其性能与最先进的方法相当。S-PLM及其轻量级调优工具可在https://github.com/duolinwang/S-PLM/上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd1/11791933/ae70fd1205f1/ADVS-12-2404212-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验