Suppr超能文献

通过大规模语义嵌入检测蛋白质之间的远程进化关系。

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

机构信息

NEC Laboratories America, Princeton, New Jersey, United States of America.

出版信息

PLoS Comput Biol. 2011 Jan 27;7(1):e1001047. doi: 10.1371/journal.pcbi.1001047.

Abstract

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.

摘要

几乎每个分子生物学家都曾在蛋白质或 DNA 序列数据库中搜索与给定查询相关的进化相关序列。序列比对方法(即查询和目标序列之间的相似性度量)为序列数据库搜索提供了引擎,并成为 30 年来计算研究的主题。对于检测蛋白质序列之间远程进化关系的难题,最成功的序列比对方法涉及构建蛋白质序列的局部模型(例如,隐马尔可夫模型)。然而,最近在网络搜索和自然语言处理等大规模数据领域的工作表明,利用数据空间的全局结构具有优势。受此工作的启发,我们提出了一种名为 ProtEmbed 的大规模算法,它将蛋白质序列嵌入到低维“语义空间”中。进化上相关的蛋白质被嵌入到接近的位置,并且可以将其他证据(例如 3D 结构相似性或类别标签)合并到学习过程中。我们发现 ProtEmbed 在远程同源性检测方面优于广泛使用的序列比对方法(如 PSI-BLAST 和 HHSearch),准确率更高;它也优于我们之前的 RankProp 算法,该算法以蛋白质相似网络的形式整合了全局结构。最后,ProtEmbed 嵌入空间可以在全局和给定查询的局部进行可视化,从而可以直观地了解蛋白质序列空间的结构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f5bc/3029239/41127f75708e/pcbi.1001047.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验