• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质结构域嵌入用于快速准确的相似性搜索。

Protein domain embeddings for fast and accurate similarity search.

机构信息

Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA.

Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA

出版信息

Genome Res. 2024 Oct 11;34(9):1434-1444. doi: 10.1101/gr.279127.124.

DOI:10.1101/gr.279127.124
PMID:39237301
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11529836/
Abstract

Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a problem and can be solved using a algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as ) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.

摘要

最近开发的蛋白质语言模型通过产生的蛋白质上下文嵌入,实现了各种应用。可以通过对单个残基的嵌入进行平均,或者对残基嵌入的矩阵应用矩阵变换技术(如离散余弦变换(DCT))来获得每个蛋白质的表示形式(每个蛋白质都表示为固定维数的向量)。这些蛋白质级别的嵌入已被应用于实现类似蛋白质的快速搜索;然而,已经发现了一些局限性;例如,PROST 擅长检测全局同源物,但不擅长检测局部同源物,而 knnProtT5 擅长处理单域蛋白质,但不擅长处理多域蛋白质。在这里,我们提出了一种新方法,该方法首先将蛋白质分割成域(或子域),然后将 DCT 应用于每个域中残基的矢量化嵌入,以推断域级上下文向量。我们的方法称为 DCTdomain,它使用 ESM-2 预测的接触图进行域分割,这被表述为一个 问题,可以使用 算法(简称 RecCut)在二次时间内解决,对于蛋白质的长度;相比之下,现有的域分割方法使用三次时间算法。我们表明,这些域级上下文向量(称为 )能够快速准确地检测具有全局相似性但共享域之间定义不明确的扩展区域的蛋白质之间的相似性,以及仅共享局部相似性的蛋白质之间的相似性。此外,在数据库搜索基准测试上的测试表明,DCTdomain 能够通过利用上下文嵌入中的结构信息来检测遥远的同源物。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/d5a59e8fc423/1434f07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/92f99b071d81/1434f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/c09c88bb9343/1434f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/87505de07183/1434f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/2c70bfc9f8f0/1434f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/a3e210c85fd1/1434f05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/fbc617e8d1eb/1434f06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/d5a59e8fc423/1434f07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/92f99b071d81/1434f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/c09c88bb9343/1434f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/87505de07183/1434f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/2c70bfc9f8f0/1434f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/a3e210c85fd1/1434f05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/fbc617e8d1eb/1434f06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dcd9/11529836/d5a59e8fc423/1434f07.jpg

相似文献

1
Protein domain embeddings for fast and accurate similarity search.蛋白质结构域嵌入用于快速准确的相似性搜索。
Genome Res. 2024 Oct 11;34(9):1434-1444. doi: 10.1101/gr.279127.124.
2
Sensitive remote homology search by local alignment of small positional embeddings from protein language models.通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。
Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.
3
Scoring alignments by embedding vector similarity.通过嵌入向量相似度对配准进行评分。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae178.
4
Alignment-free local structural search by writhe decomposition.无比对的局部结构搜索通过纽结分解。
Bioinformatics. 2010 May 1;26(9):1176-84. doi: 10.1093/bioinformatics/btq127. Epub 2010 Apr 5.
5
Automatic classification of protein structures using low-dimensional structure space mappings.利用低维结构空间映射对蛋白质结构进行自动分类。
BMC Bioinformatics. 2014;15 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2105-15-S2-S1. Epub 2014 Jan 24.
6
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+:利用异构知识资源丰富人类表型本体的节点嵌入。
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
7
Fine-tuning protein embeddings for functional similarity evaluation.调整蛋白质嵌入以进行功能相似性评估。
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae445.
8
Leveraging protein language models for accurate multiple sequence alignments.利用蛋白质语言模型进行准确的多重序列比对。
Genome Res. 2023 Jul;33(7):1145-1153. doi: 10.1101/gr.277675.123. Epub 2023 Jul 6.
9
Clustering protein functional families at large scale with hierarchical approaches.大规模使用层次方法对蛋白质功能家族进行聚类。
Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.
10
Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。
Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.

引用本文的文献

1
NEAR: neural embeddings for amino acid relationships.NEAR:用于氨基酸关系的神经嵌入
Bioinformatics. 2025 Jul 1;41(Supplement_1):i449-i457. doi: 10.1093/bioinformatics/btaf198.
2
Fast protein structure searching using structure graph embeddings.使用结构图形嵌入的快速蛋白质结构搜索
Bioinform Adv. 2024 Mar 5;5(1):vbaf042. doi: 10.1093/bioadv/vbaf042. eCollection 2025.
3
NEAR: Neural Embeddings for Amino acid Relationships.NEAR:用于氨基酸关系的神经嵌入

本文引用的文献

1
Protein embedding based alignment.基于蛋白质嵌入的对齐。
BMC Bioinformatics. 2024 Feb 28;25(1):85. doi: 10.1186/s12859-024-05699-5.
2
Leveraging protein language models for accurate multiple sequence alignments.利用蛋白质语言模型进行准确的多重序列比对。
Genome Res. 2023 Jul;33(7):1145-1153. doi: 10.1101/gr.277675.123. Epub 2023 Jul 6.
3
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
bioRxiv. 2025 Apr 9:2024.01.25.577287. doi: 10.1101/2024.01.25.577287.
4
Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述
Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.
5
The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.多重序列比对在分子结构与功能预测中的历史演变及意义
Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531.
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
4
Improved global protein homolog detection with major gains in function identification.提高全局蛋白质同源物检测的功能识别能力。
Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.
5
A unified approach to protein domain parsing with inter-residue distance matrix.基于残基间距离矩阵的蛋白质结构域解析的统一方法
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad070.
6
DPAM: A domain parser for AlphaFold models.DPAM:用于 AlphaFold 模型的域解析器。
Protein Sci. 2023 Feb;32(2):e4548. doi: 10.1002/pro.4548.
7
Nearest neighbor search on embeddings rapidly identifies distant protein relations.对嵌入进行最近邻搜索可快速识别远距离蛋白质关系。
Front Bioinform. 2022 Nov 17;2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.
8
InterPro in 2022.InterPro 在 2022 年。
Nucleic Acids Res. 2023 Jan 6;51(D1):D418-D427. doi: 10.1093/nar/gkac993.
9
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
10
FUpred: detecting protein domains through deep-learning-based contact map prediction.FUpred:基于深度学习的接触图预测的蛋白质结构域检测。
Bioinformatics. 2020 Jun 1;36(12):3749-3757. doi: 10.1093/bioinformatics/btaa217.