New England Biolabs Inc, Ipswich, United States.
Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.
Accurately detecting distant evolutionary relationships between proteins remains an ongoing challenge in bioinformatics. Search methods based on primary sequence struggle to accurately detect homology between sequences with less than 20% amino acid identity. Profile- and structure-based strategies extend sensitive search capabilities into this twilight zone of sequence similarity but require slow pre-processing steps. Recently, whole-protein and positional embeddings from deep neural networks have shown promise for providing sensitive sequence comparison and annotation at long evolutionary distances. Embeddings are generally faster to compute than profiles and predicted structures but still suffer several drawbacks related to the ability of whole-protein embeddings to discriminate domain-level homology, and the database size and search speed of methods using positional embeddings. In this work, we show that low-dimensionality positional embeddings can be used directly in speed-optimized local search algorithms. As a proof of concept, we use the ESM2 3B model to convert primary sequences directly into the 3D interaction (3Di) alphabet or amino acid profiles and use these embeddings as input to the highly optimized Foldseek, HMMER3, and HH-suite search algorithms. Our results suggest that positional embeddings as small as a single byte can provide sufficient information for dramatically improved sensitivity over amino acid sequence searches without sacrificing search speed.
准确检测蛋白质之间的遥远进化关系仍然是生物信息学中的一个持续挑战。基于一级序列的搜索方法难以准确检测具有低于 20%氨基酸同一性的序列之间的同源性。基于轮廓和结构的策略将敏感的搜索能力扩展到序列相似性的这个黄昏地带,但需要缓慢的预处理步骤。最近,来自深度神经网络的全蛋白质和位置嵌入在提供远距离进化的敏感序列比较和注释方面显示出了希望。嵌入通常比轮廓和预测结构计算速度更快,但仍然存在一些缺点,涉及全蛋白质嵌入区分域级同源性的能力,以及使用位置嵌入的方法的数据库大小和搜索速度。在这项工作中,我们表明低维位置嵌入可以直接用于加速优化的局部搜索算法。作为概念验证,我们使用 ESM2 3B 模型将一级序列直接转换为 3D 相互作用 (3Di) 字母或氨基酸轮廓,并将这些嵌入用作高度优化的 Foldseek、HMMER3 和 HH-suite 搜索算法的输入。我们的结果表明,位置嵌入小到一个字节就可以提供足够的信息,在不牺牲搜索速度的情况下,大大提高对氨基酸序列搜索的敏感性。