Suppr超能文献

用于蛋白质序列和结构的双语语言模型。

Bilingual language model for protein sequence and structure.

作者信息

Heinzinger Michael, Weissenow Konstantin, Sanchez Joaquin Gomez, Henkel Adrian, Mirdita Milot, Steinegger Martin, Rost Burkhard

机构信息

School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.

School of Biological Sciences, Seoul National University, 08826 Seoul, South Korea.

出版信息

NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.

Abstract

Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method . For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.

摘要

使语言模型适应蛋白质序列催生了强大的蛋白质语言模型(pLMs)的发展。与此同时,AlphaFold2在蛋白质结构预测方面取得了突破。现在,我们可以系统而全面地探索蛋白质的双重性质,即蛋白质作为三维(3D)机器发挥作用并存在,同时作为一维(1D)序列的线性串进行进化。在此,我们利用pLMs在单个模型中同时对这两种模式进行建模。我们使用3D比对方法引入的3Di字母表将蛋白质结构编码为标记序列。为了进行训练,我们从AlphaFoldDB构建了一个非冗余数据集,并对现有的pLM(ProtT5)进行微调,以在3Di和氨基酸序列之间进行转换。作为我们称为蛋白质“结构 - 序列”T5(Protein 'structure-sequence' T5)的新方法的概念验证,我们展示了在后续与结构相关的预测任务中性能的提升,从而使推导3Di的速度提高了三个数量级。这对于未来试图以结构比较的灵敏度搜索宏基因组序列数据库的应用至关重要。我们的工作展示了pLMs利用由AlphaFold2推动的信息丰富的蛋白质结构革命的潜力,为开发整合大量3D预测资源的新工具铺平了道路,并在后AlphaFold2时代开辟了新的研究途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad5b/11616678/757241afc6f1/lqae150fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验