Suppr超能文献

通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.

机构信息

New England Biolabs Inc, Ipswich, United States.

出版信息

Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.

Abstract

Accurately detecting distant evolutionary relationships between proteins remains an ongoing challenge in bioinformatics. Search methods based on primary sequence struggle to accurately detect homology between sequences with less than 20% amino acid identity. Profile- and structure-based strategies extend sensitive search capabilities into this twilight zone of sequence similarity but require slow pre-processing steps. Recently, whole-protein and positional embeddings from deep neural networks have shown promise for providing sensitive sequence comparison and annotation at long evolutionary distances. Embeddings are generally faster to compute than profiles and predicted structures but still suffer several drawbacks related to the ability of whole-protein embeddings to discriminate domain-level homology, and the database size and search speed of methods using positional embeddings. In this work, we show that low-dimensionality positional embeddings can be used directly in speed-optimized local search algorithms. As a proof of concept, we use the ESM2 3B model to convert primary sequences directly into the 3D interaction (3Di) alphabet or amino acid profiles and use these embeddings as input to the highly optimized Foldseek, HMMER3, and HH-suite search algorithms. Our results suggest that positional embeddings as small as a single byte can provide sufficient information for dramatically improved sensitivity over amino acid sequence searches without sacrificing search speed.

摘要

准确检测蛋白质之间的遥远进化关系仍然是生物信息学中的一个持续挑战。基于一级序列的搜索方法难以准确检测具有低于 20%氨基酸同一性的序列之间的同源性。基于轮廓和结构的策略将敏感的搜索能力扩展到序列相似性的这个黄昏地带,但需要缓慢的预处理步骤。最近,来自深度神经网络的全蛋白质和位置嵌入在提供远距离进化的敏感序列比较和注释方面显示出了希望。嵌入通常比轮廓和预测结构计算速度更快,但仍然存在一些缺点,涉及全蛋白质嵌入区分域级同源性的能力,以及使用位置嵌入的方法的数据库大小和搜索速度。在这项工作中,我们表明低维位置嵌入可以直接用于加速优化的局部搜索算法。作为概念验证,我们使用 ESM2 3B 模型将一级序列直接转换为 3D 相互作用 (3Di) 字母或氨基酸轮廓,并将这些嵌入用作高度优化的 Foldseek、HMMER3 和 HH-suite 搜索算法的输入。我们的结果表明,位置嵌入小到一个字节就可以提供足够的信息,在不牺牲搜索速度的情况下,大大提高对氨基酸序列搜索的敏感性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5fa3/10942778/00e31e185c97/elife-91415-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验