Suppr超能文献

在蛋白质序列同源性的模糊地带:蛋白质语言模型能学习蛋白质结构吗?

In the twilight zone of protein sequence homology: do protein language models learn protein structure?

作者信息

Kabir Anowarul, Moldwin Asher, Bromberg Yana, Shehu Amarda

机构信息

Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.

Department of Computer Science, Emory University, Atlanta, GA 30307, United States.

出版信息

Bioinform Adv. 2024 Aug 17;4(1):vbae119. doi: 10.1093/bioadv/vbae119. eCollection 2024.

Abstract

MOTIVATION

Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.

RESULTS

We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.

AVAILABILITY AND IMPLEMENTATION

We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

摘要

动机

基于Transformer架构的蛋白质语言模型在蛋白质预测任务(包括二级结构、亚细胞定位等)上的性能日益提升。尽管仅在蛋白质序列上进行训练,但蛋白质语言模型似乎能隐式地学习蛋白质结构。本文研究蛋白质语言模型学习到的序列表示是否编码了结构信息以及编码程度如何。

结果

我们通过在远程同源性预测任务中评估蛋白质语言模型来解决这个问题,在该任务中,仅从序列信息识别远程同源物需要结构知识,尤其是在序列同一性非常低的“模糊区域”。通过在逐渐降低的序列同一性下进行严格测试,我们在零样本设置下剖析了从数百万到数十亿参数的蛋白质语言模型的性能。我们的研究结果表明,虽然基于Transformer的蛋白质语言模型优于传统的序列比对方法,但它们在模糊区域仍然存在困难。这表明当序列信号较弱时,当前的蛋白质语言模型尚未充分学习蛋白质结构以解决远程同源性预测问题。

可用性和实现

我们认为这为远程同源性预测以及学习富含序列和结构的蛋白质分子表示这一更广泛目标的进一步研究开辟了道路。所有代码、数据和模型均已公开提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/89d022c0f784/vbae119f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验