• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在蛋白质序列同源性的模糊地带:蛋白质语言模型能学习蛋白质结构吗?

In the twilight zone of protein sequence homology: do protein language models learn protein structure?

作者信息

Kabir Anowarul, Moldwin Asher, Bromberg Yana, Shehu Amarda

机构信息

Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.

Department of Computer Science, Emory University, Atlanta, GA 30307, United States.

出版信息

Bioinform Adv. 2024 Aug 17;4(1):vbae119. doi: 10.1093/bioadv/vbae119. eCollection 2024.

DOI:10.1093/bioadv/vbae119
PMID:39183802
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11344590/
Abstract

MOTIVATION

Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.

RESULTS

We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.

AVAILABILITY AND IMPLEMENTATION

We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

摘要

动机

基于Transformer架构的蛋白质语言模型在蛋白质预测任务(包括二级结构、亚细胞定位等)上的性能日益提升。尽管仅在蛋白质序列上进行训练,但蛋白质语言模型似乎能隐式地学习蛋白质结构。本文研究蛋白质语言模型学习到的序列表示是否编码了结构信息以及编码程度如何。

结果

我们通过在远程同源性预测任务中评估蛋白质语言模型来解决这个问题,在该任务中,仅从序列信息识别远程同源物需要结构知识,尤其是在序列同一性非常低的“模糊区域”。通过在逐渐降低的序列同一性下进行严格测试,我们在零样本设置下剖析了从数百万到数十亿参数的蛋白质语言模型的性能。我们的研究结果表明,虽然基于Transformer的蛋白质语言模型优于传统的序列比对方法,但它们在模糊区域仍然存在困难。这表明当序列信号较弱时,当前的蛋白质语言模型尚未充分学习蛋白质结构以解决远程同源性预测问题。

可用性和实现

我们认为这为远程同源性预测以及学习富含序列和结构的蛋白质分子表示这一更广泛目标的进一步研究开辟了道路。所有代码、数据和模型均已公开提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/3083b958a778/vbae119f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/89d022c0f784/vbae119f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/3083b958a778/vbae119f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/89d022c0f784/vbae119f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd7e/11344590/3083b958a778/vbae119f2.jpg

相似文献

1
In the twilight zone of protein sequence homology: do protein language models learn protein structure?在蛋白质序列同源性的模糊地带:蛋白质语言模型能学习蛋白质结构吗?
Bioinform Adv. 2024 Aug 17;4(1):vbae119. doi: 10.1093/bioadv/vbae119. eCollection 2024.
2
Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone.基于嵌入的对齐:将蛋白质语言模型与动态规划对齐相结合,以检测“黄昏地带”中的结构相似性。
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btad786.
3
Beyond the Twilight Zone: automated prediction of structural properties of proteins by recursive neural networks and remote homology information.超越模糊地带:利用递归神经网络和远程同源信息自动预测蛋白质的结构特性
Proteins. 2009 Oct;77(1):181-90. doi: 10.1002/prot.22429.
4
Prediction of protein secondary structure content for the twilight zone sequences.预测处于模糊区域序列的蛋白质二级结构含量。
Proteins. 2007 Nov 15;69(3):486-98. doi: 10.1002/prot.21527.
5
Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences.从与预测序列具有 twilight-zone 身份的序列中预测蛋白质结构类别
BMC Bioinformatics. 2009 Dec 13;10:414. doi: 10.1186/1471-2105-10-414.
6
SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.SCPRED:对与预测序列具有模糊相似性的序列的蛋白质结构类别进行准确预测。
BMC Bioinformatics. 2008 May 1;9:226. doi: 10.1186/1471-2105-9-226.
7
Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space.探索蛋白质结构的表示方法以进行自动远程同源性检测和蛋白质结构空间映射。
BMC Bioinformatics. 2014;15 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-15-S8-S4. Epub 2014 Jul 14.
8
Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.基于图论的序列描述符可作为远程同源性预测器。
Biomolecules. 2019 Dec 23;10(1):26. doi: 10.3390/biom10010026.
9
Sequence representation and prediction of protein secondary structure for structural motifs in twilight zone proteins.近缘蛋白质中结构基序的蛋白质二级结构的序列表示与预测
Protein J. 2006 Dec;25(7-8):463-74. doi: 10.1007/s10930-006-9029-0.
10
A novel sequence alignment algorithm based on deep learning of the protein folding code.一种基于蛋白质折叠码深度学习的新型序列比对算法。
Bioinformatics. 2021 May 1;37(4):490-496. doi: 10.1093/bioinformatics/btaa810.

本文引用的文献

1
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.PLMSearch:蛋白质语言模型为远程同源性的准确快速序列搜索提供动力。
Nat Commun. 2024 Mar 30;15(1):2775. doi: 10.1038/s41467-024-46808-5.
2
Sensitive remote homology search by local alignment of small positional embeddings from protein language models.通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。
Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.
3
pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models.
pLM-BLAST:基于蛋白质语言模型序列表示的直接比较进行远缘同源检测。
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad579.
4
Protein remote homology detection and structural alignment using deep learning.使用深度学习进行蛋白质远程同源检测和结构比对。
Nat Biotechnol. 2024 Jun;42(6):975-985. doi: 10.1038/s41587-023-01917-2. Epub 2023 Sep 7.
5
Improved global protein homolog detection with major gains in function identification.提高全局蛋白质同源物检测的功能识别能力。
Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.
6
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
7
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins.蛋白质语言模型的进化速度可预测多种蛋白质的进化动态。
Cell Syst. 2022 Apr 20;13(4):274-285.e6. doi: 10.1016/j.cels.2022.01.003. Epub 2022 Feb 3.
8
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
9
Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。
Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.
10
SCOPe: improvements to the structural classification of proteins - extended database to facilitate variant interpretation and machine learning.SCOPe:蛋白质结构分类的改进——扩展数据库以促进变体解释和机器学习。
Nucleic Acids Res. 2022 Jan 7;50(D1):D553-D559. doi: 10.1093/nar/gkab1054.