Center for Data Science, New York University, New York, NY, USA.
Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
Nat Biotechnol. 2024 Jun;42(6):975-985. doi: 10.1038/s41587-023-01917-2. Epub 2023 Sep 7.
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
利用生物技术中的序列-结构-功能关系需要改进的方法来对齐具有低序列相似性的蛋白质与先前注释的蛋白质。我们开发了两种深度学习方法来解决这个差距,TM-Vec 和 DeepBLAST。TM-Vec 允许在大型序列数据库中搜索结构-结构相似性。它经过训练,可以直接从序列对中准确预测 TM 分数作为结构相似性的度量,而无需中间计算或结构求解。一旦确定了结构相似的蛋白质,DeepBLAST 就可以仅使用序列信息通过识别蛋白质之间的结构同源区域来进行蛋白质的结构对齐。它优于传统的序列比对方法,并且与基于结构的比对方法表现相当。我们在各种数据集上展示了 TM-Vec 和 DeepBLAST 的优点,包括与最先进的序列比对和结构预测方法相比,更好地识别远程同源蛋白质。