通过大规模语义嵌入检测蛋白质之间的远程进化关系。

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

机构信息

NEC Laboratories America, Princeton, New Jersey, United States of America.

出版信息

PLoS Comput Biol. 2011 Jan 27;7(1):e1001047. doi: 10.1371/journal.pcbi.1001047.

DOI:10.1371/journal.pcbi.1001047

PMID:21298082

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3029239/

Abstract

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.

摘要

几乎每个分子生物学家都曾在蛋白质或 DNA 序列数据库中搜索与给定查询相关的进化相关序列。序列比对方法（即查询和目标序列之间的相似性度量）为序列数据库搜索提供了引擎，并成为 30 年来计算研究的主题。对于检测蛋白质序列之间远程进化关系的难题，最成功的序列比对方法涉及构建蛋白质序列的局部模型（例如，隐马尔可夫模型）。然而，最近在网络搜索和自然语言处理等大规模数据领域的工作表明，利用数据空间的全局结构具有优势。受此工作的启发，我们提出了一种名为 ProtEmbed 的大规模算法，它将蛋白质序列嵌入到低维“语义空间”中。进化上相关的蛋白质被嵌入到接近的位置，并且可以将其他证据（例如 3D 结构相似性或类别标签）合并到学习过程中。我们发现 ProtEmbed 在远程同源性检测方面优于广泛使用的序列比对方法（如 PSI-BLAST 和 HHSearch），准确率更高；它也优于我们之前的 RankProp 算法，该算法以蛋白质相似网络的形式整合了全局结构。最后，ProtEmbed 嵌入空间可以在全局和给定查询的局部进行可视化，从而可以直观地了解蛋白质序列空间的结构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f5bc/3029239/41127f75708e/pcbi.1001047.g001.jpg

相似文献

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.通过大规模语义嵌入检测蛋白质之间的远程进化关系。

PLoS Comput Biol. 2011 Jan 27;7(1):e1001047. doi: 10.1371/journal.pcbi.1001047.

Protein ranking by semi-supervised network propagation.基于半监督网络传播的蛋白质排名

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-7-S1-S10.

Identifying remote protein homologs by network propagation.通过网络传播识别远程蛋白质同源物。

FEBS J. 2005 Oct;272(20):5119-28. doi: 10.1111/j.1742-4658.2005.04947.x.

Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships.结合成对序列相似性和支持向量机来检测远距离蛋白质进化和结构关系。

J Comput Biol. 2003;10(6):857-68. doi: 10.1089/106652703322756113.

Protein homology detection by HMM-HMM comparison.通过隐马尔可夫模型（HMM）比较进行蛋白质同源性检测。

Bioinformatics. 2005 Apr 1;21(7):951-60. doi: 10.1093/bioinformatics/bti125. Epub 2004 Nov 5.

Homology detection via family pairwise search.通过家族成对搜索进行同源性检测。

J Comput Biol. 1998 Fall;5(3):479-91. doi: 10.1089/cmb.1998.5.479.

Large-scale comparison of protein sequence alignment algorithms with structure alignments.蛋白质序列比对算法与结构比对的大规模比较。

Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7.

RANKPROP: a web server for protein remote homology detection.RANKPROP：用于蛋白质远程同源性检测的网络服务器。

Bioinformatics. 2009 Jan 1;25(1):121-2. doi: 10.1093/bioinformatics/btn567. Epub 2008 Nov 6.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.SVM-HUSTLE——一种用于成对蛋白质远程同源性检测的迭代半监督机器学习方法。

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.使用多序列进行的序列比较所检测到的远源同源物数量是成对方法的三倍。

J Mol Biol. 1998 Dec 11;284(4):1201-10. doi: 10.1006/jmbi.1998.2221.

引用本文的文献

Case Studies of Orphan Domain Reclassification in ECOD by Expert Curation.通过专家管理对ECOD中孤儿结构域重新分类的案例研究。

Proteins. 2025 May 26. doi: 10.1002/prot.26840.

Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

PRIP: A Protein-RNA Interface Predictor Based on Semantics of Sequences.PRIP：一种基于序列语义的蛋白质-核糖核酸界面预测工具

Life (Basel). 2022 Feb 18;12(2):307. doi: 10.3390/life12020307.

dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation.dRHP-PseRA：基于轮廓的伪蛋白质序列和排序聚合检测远程同源蛋白质。

Sci Rep. 2016 Sep 1;6:32333. doi: 10.1038/srep32333.

CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction.CMsearch：同时探索蛋白质序列空间和结构空间不仅能改善蛋白质同源性检测，还能提升蛋白质结构预测。

Bioinformatics. 2016 Jun 15;32(12):i332-i340. doi: 10.1093/bioinformatics/btw271.

Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection.结合频率谱中提取的进化信息与基于序列的核函数进行蛋白质远程同源检测。

Bioinformatics. 2014 Feb 15;30(4):472-9. doi: 10.1093/bioinformatics/btt709. Epub 2013 Dec 5.

Maps of protein structure space reveal a fundamental relationship between protein structure and function.蛋白质结构空间图谱揭示了蛋白质结构与功能之间的基本关系。

Proc Natl Acad Sci U S A. 2011 Jul 26;108(30):12301-6. doi: 10.1073/pnas.1102727108. Epub 2011 Jul 7.

本文引用的文献

A fast and automated solution for accurately resolving protein domain architectures.一种快速且自动化的解决方案，可准确解析蛋白质结构域架构。

Bioinformatics. 2010 Mar 15;26(6):745-51. doi: 10.1093/bioinformatics/btq034. Epub 2010 Jan 29.

Upcoming challenges for multiple sequence alignment methods in the high-throughput era.高通量时代下多序列比对方法面临的挑战。

Bioinformatics. 2009 Oct 1;25(19):2455-65. doi: 10.1093/bioinformatics/btp452. Epub 2009 Jul 30.

RANKPROP: a web server for protein remote homology detection.RANKPROP：用于蛋白质远程同源性检测的网络服务器。

Bioinformatics. 2009 Jan 1;25(1):121-2. doi: 10.1093/bioinformatics/btn567. Epub 2008 Nov 6.

The global trace graph, a novel paradigm for searching protein sequence databases.全局追踪图，一种搜索蛋白质序列数据库的新范式。

Bioinformatics. 2007 Sep 15;23(18):2361-7. doi: 10.1093/bioinformatics/btm358. Epub 2007 Sep 6.

The HHpred interactive server for protein homology detection and structure prediction.用于蛋白质同源性检测和结构预测的HHpred交互式服务器。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W244-8. doi: 10.1093/nar/gki408.

ADDA: a domain database with global coverage of the protein universe.ADDA：一个覆盖蛋白质全域的领域数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D188-91. doi: 10.1093/nar/gki096.

Protein ranking: from local to global structure in the protein similarity network.蛋白质排序：蛋白质相似性网络中从局部结构到全局结构

Proc Natl Acad Sci U S A. 2004 Apr 27;101(17):6559-63. doi: 10.1073/pnas.0308067101. Epub 2004 Apr 15.

MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison.MAMMOTH（从理论中获得的匹配分子模型）：一种用于模型比较的自动化方法。

Protein Sci. 2002 Nov;11(11):2606-21. doi: 10.1110/ps.0215902.

Using the Fisher kernel method to detect remote protein homologies.使用费舍尔核方法检测远程蛋白质同源性。

Proc Int Conf Intell Syst Mol Biol. 1999:149-58.

Comparison of sequence profiles. Strategies for structural predictions using sequence information.序列图谱比较。利用序列信息进行结构预测的策略。

Protein Sci. 2000 Feb;9(2):232-41. doi: 10.1110/ps.9.2.232.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过大规模语义嵌入检测蛋白质之间的远程进化关系。

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献