通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.

机构信息

New England Biolabs Inc, Ipswich, United States.

出版信息

Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.

DOI:10.7554/eLife.91415

PMID:38488154

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10942778/

Abstract

Accurately detecting distant evolutionary relationships between proteins remains an ongoing challenge in bioinformatics. Search methods based on primary sequence struggle to accurately detect homology between sequences with less than 20% amino acid identity. Profile- and structure-based strategies extend sensitive search capabilities into this twilight zone of sequence similarity but require slow pre-processing steps. Recently, whole-protein and positional embeddings from deep neural networks have shown promise for providing sensitive sequence comparison and annotation at long evolutionary distances. Embeddings are generally faster to compute than profiles and predicted structures but still suffer several drawbacks related to the ability of whole-protein embeddings to discriminate domain-level homology, and the database size and search speed of methods using positional embeddings. In this work, we show that low-dimensionality positional embeddings can be used directly in speed-optimized local search algorithms. As a proof of concept, we use the ESM2 3B model to convert primary sequences directly into the 3D interaction (3Di) alphabet or amino acid profiles and use these embeddings as input to the highly optimized Foldseek, HMMER3, and HH-suite search algorithms. Our results suggest that positional embeddings as small as a single byte can provide sufficient information for dramatically improved sensitivity over amino acid sequence searches without sacrificing search speed.

摘要

准确检测蛋白质之间的遥远进化关系仍然是生物信息学中的一个持续挑战。基于一级序列的搜索方法难以准确检测具有低于 20%氨基酸同一性的序列之间的同源性。基于轮廓和结构的策略将敏感的搜索能力扩展到序列相似性的这个黄昏地带，但需要缓慢的预处理步骤。最近，来自深度神经网络的全蛋白质和位置嵌入在提供远距离进化的敏感序列比较和注释方面显示出了希望。嵌入通常比轮廓和预测结构计算速度更快，但仍然存在一些缺点，涉及全蛋白质嵌入区分域级同源性的能力，以及使用位置嵌入的方法的数据库大小和搜索速度。在这项工作中，我们表明低维位置嵌入可以直接用于加速优化的局部搜索算法。作为概念验证，我们使用 ESM2 3B 模型将一级序列直接转换为 3D 相互作用 (3Di) 字母或氨基酸轮廓，并将这些嵌入用作高度优化的 Foldseek、HMMER3 和 HH-suite 搜索算法的输入。我们的结果表明，位置嵌入小到一个字节就可以提供足够的信息，在不牺牲搜索速度的情况下，大大提高对氨基酸序列搜索的敏感性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fa3/10942778/00e31e185c97/elife-91415-fig1.jpg

相似文献

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。

Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.

Leveraging protein language models for accurate multiple sequence alignments.利用蛋白质语言模型进行准确的多重序列比对。

Genome Res. 2023 Jul;33(7):1145-1153. doi: 10.1101/gr.277675.123. Epub 2023 Jul 6.

xHMMER3x2: Utilizing HMMER3's speed and HMMER2's sensitivity and specificity in the glocal alignment mode for improved large-scale protein domain annotation.xHMMER3x2：在全局局部比对模式下利用HMMER3的速度以及HMMER2的灵敏度和特异性，以改进大规模蛋白质结构域注释。

Biol Direct. 2016 Nov 29;11(1):63. doi: 10.1186/s13062-016-0163-0.

Incorporating homologues into sequence embeddings for protein analysis.将同源物纳入用于蛋白质分析的序列嵌入中。

J Bioinform Comput Biol. 2007 Jun;5(3):717-38. doi: 10.1142/s0219720007002734.

Improvements in viral gene annotation using large language models and soft alignments.利用大型语言模型和软对齐技术改进病毒基因注释。

BMC Bioinformatics. 2024 Apr 25;25(1):165. doi: 10.1186/s12859-024-05779-6.

Fold homology detection using sequence fragment composition profiles of proteins.使用蛋白质序列片段组成特征来检测折叠同源性。

Proteins. 2010 Oct;78(13):2745-56. doi: 10.1002/prot.22788.

Within the twilight zone: a sensitive profile-profile comparison tool based on information theory.在模糊区域内：一种基于信息论的灵敏的轮廓-轮廓比较工具。

J Mol Biol. 2002 Feb 1;315(5):1257-75. doi: 10.1006/jmbi.2001.5293.

Assessing the role of evolutionary information for enhancing protein language model embeddings.评估进化信息在增强蛋白质语言模型嵌入中的作用。

Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.用于蛋白质同源物的迭代序列/二级结构搜索：与氨基酸序列比对的比较及在基因组数据库中折叠识别的应用

Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学：生物信息学中大型语言模型的全面综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Predicting phage-host interaction via hyperbolic Poincaré graph embedding and large-scale protein language technique.通过双曲庞加莱图嵌入和大规模蛋白质语言技术预测噬菌体-宿主相互作用。

iScience. 2024 Dec 19;28(1):111647. doi: 10.1016/j.isci.2024.111647. eCollection 2025 Jan 17.

Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems.Domainator是一个用于基于结构域的注释和邻域分析的灵活软件套件，可识别参与抗病毒系统的蛋白质。

Nucleic Acids Res. 2025 Jan 11;53(2). doi: 10.1093/nar/gkae1175.

High fitness paths can connect proteins with low sequence overlap.高适应性路径可以连接序列重叠度低的蛋白质。

ArXiv. 2024 Nov 13:arXiv:2411.09054v1.

High fitness paths can connect proteins with low sequence overlap.高适应性路径可以连接序列重叠度低的蛋白质。

bioRxiv. 2024 Nov 15:2024.11.13.623265. doi: 10.1101/2024.11.13.623265.

In the twilight zone of protein sequence homology: do protein language models learn protein structure?在蛋白质序列同源性的模糊地带：蛋白质语言模型能学习蛋白质结构吗？

Bioinform Adv. 2024 Aug 17;4(1):vbae119. doi: 10.1093/bioadv/vbae119. eCollection 2024.

本文引用的文献

Protein embedding based alignment.基于蛋白质嵌入的对齐。

BMC Bioinformatics. 2024 Feb 28;25(1):85. doi: 10.1186/s12859-024-05699-5.

Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone.基于嵌入的对齐：将蛋白质语言模型与动态规划对齐相结合，以检测“黄昏地带”中的结构相似性。

Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btad786.

How AlphaFold2 shaped the structural coverage of the human transmembrane proteome.AlphaFold2 如何塑造人类跨膜蛋白质组的结构覆盖范围。

Sci Rep. 2023 Nov 20;13(1):20283. doi: 10.1038/s41598-023-47204-7.

pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models.pLM-BLAST：基于蛋白质语言模型序列表示的直接比较进行远缘同源检测。

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad579.

Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

ProteInfer, deep neural networks for protein functional inference.蛋白推断，用于蛋白质功能推断的深度神经网络。

Elife. 2023 Feb 27;12:e80942. doi: 10.7554/eLife.80942.

Improved global protein homolog detection with major gains in function identification.提高全局蛋白质同源物检测的功能识别能力。

Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.

MGnify: the microbiome sequence data analysis resource in 2023.MGnify：2023 年的微生物组序列数据分析资源。

Nucleic Acids Res. 2023 Jan 6;51(D1):D753-D759. doi: 10.1093/nar/gkac1080.

Nearest neighbor search on embeddings rapidly identifies distant protein relations.对嵌入进行最近邻搜索可快速识别远距离蛋白质关系。

Front Bioinform. 2022 Nov 17;2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献