• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

学习监督嵌入进行大规模序列比较。

Learning supervised embeddings for large scale sequence comparisons.

机构信息

Department of ECE, Indraprastha Institute of Information Technology-Delhi, New Delhi, India.

School of Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia.

出版信息

PLoS One. 2020 Mar 13;15(3):e0216636. doi: 10.1371/journal.pone.0216636. eCollection 2020.

DOI:10.1371/journal.pone.0216636
PMID:32168338
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7069636/
Abstract

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.

摘要

基于相似性的序列集合搜索是生物信息学的核心任务,在基因组时代的大部分时间里,它主要由精确和启发式的基于比对的算法主导。然而,即使是高效的启发式算法,如 BLAST,也可能无法扩展到现在出现的数据集,这促使人们开发了一系列基于无比对的替代方法,利用每个序列的基本词汇结构。在本文中,我们引入了两种监督方法——SuperVec 和 SuperVecX——来学习序列嵌入。这些方法扩展了早期基于表示学习(RepL)的方法,在训练过程中为每个序列包含与类别相关的信息。包含类别信息确保了相关的序列片段在目标空间中具有接近的表示,更好地反映了领域的结构。我们展示了通过这些方法学习的嵌入在(i)序列检索和(ii)分类任务中的质量。我们还提出了一种专门为序列检索问题设计的基于层次树的方法。根据它们各自对 SuperVec 或 SuperVecX 的使用,所得到的方法,我们称之为 H-SuperVec 或 H-SuperVecX,在一系列特征空间中学习嵌入,这些特征空间基于类标签的排他和穷尽子集。实验表明,与现有的(无监督)基于 RepL 的方法相比,所提出的方法在检索和分类任务中表现更好。此外,对于数据库检索任务,新方法比 BLAST 快一个数量级,支持混合方法,可以快速过滤集合,只保留潜在相关的记录。对原始数据库的这种过滤允许在更小的数据集上快速执行较慢但更准确的方法。因此,我们可以实现比以前更快的查询处理和更高的精度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/42aa50edacb0/pone.0216636.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/b4d9675acfaf/pone.0216636.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/c9b303a34d4f/pone.0216636.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/8117185755c6/pone.0216636.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/498b426a3053/pone.0216636.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/1c7bc40c96d7/pone.0216636.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/50e52abf9099/pone.0216636.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/1454c25bfabc/pone.0216636.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/545045a22d3f/pone.0216636.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/b84561291d7b/pone.0216636.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/9d9ab01e641f/pone.0216636.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/ec4d42a0725c/pone.0216636.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/9b334bab0f7c/pone.0216636.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/91e8bb869665/pone.0216636.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/42aa50edacb0/pone.0216636.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/b4d9675acfaf/pone.0216636.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/c9b303a34d4f/pone.0216636.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/8117185755c6/pone.0216636.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/498b426a3053/pone.0216636.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/1c7bc40c96d7/pone.0216636.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/50e52abf9099/pone.0216636.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/1454c25bfabc/pone.0216636.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/545045a22d3f/pone.0216636.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/b84561291d7b/pone.0216636.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/9d9ab01e641f/pone.0216636.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/ec4d42a0725c/pone.0216636.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/9b334bab0f7c/pone.0216636.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/91e8bb869665/pone.0216636.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74d9/7069636/42aa50edacb0/pone.0216636.g014.jpg

相似文献

1
Learning supervised embeddings for large scale sequence comparisons.学习监督嵌入进行大规模序列比较。
PLoS One. 2020 Mar 13;15(3):e0216636. doi: 10.1371/journal.pone.0216636. eCollection 2020.
2
ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval.ProDis-ContSHC:在蛋白质数据库检索中用于蛋白质-蛋白质比较的学习蛋白质非相似性度量和层次上下文一致性。
BMC Bioinformatics. 2012 May 8;13 Suppl 7(Suppl 7):S2. doi: 10.1186/1471-2105-13-S7-S2.
3
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+:利用异构知识资源丰富人类表型本体的节点嵌入。
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
4
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.
5
Leveraging medical context to recommend semantically similar terms for chart reviews.利用医疗背景为图表审查推荐语义相似的术语。
BMC Med Inform Decis Mak. 2021 Dec 18;21(1):353. doi: 10.1186/s12911-021-01724-2.
6
Fine-Tuning Word Embeddings for Hierarchical Representation of Data Using a Corpus and a Knowledge Base for Various Machine Learning Applications.使用语料库和知识库对数据进行层次表示的词向量微调,用于各种机器学习应用。
Comput Math Methods Med. 2021 Nov 16;2021:9761163. doi: 10.1155/2021/9761163. eCollection 2021.
7
Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.用于基因组区间集检索和注释的联合表示学习
Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.
8
Generating region proposals for histopathological whole slide image retrieval.生成用于组织病理学全切片图像检索的区域建议。
Comput Methods Programs Biomed. 2018 Jun;159:1-10. doi: 10.1016/j.cmpb.2018.02.020. Epub 2018 Feb 23.
9
Combining unsupervised, supervised and rule-based learning: the case of detecting patient allergies in electronic health records.结合无监督、监督和基于规则的学习:以电子健康记录中检测患者过敏为例。
BMC Med Inform Decis Mak. 2023 Sep 18;23(1):188. doi: 10.1186/s12911-023-02271-8.
10
Time-sensitive clinical concept embeddings learned from large electronic health records.从大型电子健康记录中学习的时间敏感型临床概念嵌入。
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):58. doi: 10.1186/s12911-019-0766-3.

引用本文的文献

1
Machine learning assessment of zoonotic potential in avian influenza viruses using PB2 segment.利用PB2片段对禽流感病毒人畜共患病潜力进行机器学习评估。
BMC Genomics. 2025 Apr 23;26(1):395. doi: 10.1186/s12864-025-11589-8.
2
AutoCoV: tracking the early spread of COVID-19 in terms of the spatial and temporal patterns from embedding space by K-mer based deep learning.AutoCoV:基于 K -mer 深度学习的嵌入空间追踪 COVID-19 时空模式的早期传播。
BMC Bioinformatics. 2022 Apr 25;23(Suppl 3):149. doi: 10.1186/s12859-022-04679-x.
3
Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.

本文引用的文献

1
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
2
Learned protein embeddings for machine learning.用于机器学习的习得蛋白质嵌入。
Bioinformatics. 2018 Dec 1;34(23):4138. doi: 10.1093/bioinformatics/bty455.
3
DEEPre: sequence-based enzyme EC number prediction by deep learning.DEEPre:基于深度学习的酶 EC 号序列预测。
无监督获取符号自然语言的惯用单位:基于 n 元频率的新闻文章和推文切分方法。
PLoS One. 2020 Jun 8;15(6):e0234214. doi: 10.1371/journal.pone.0234214. eCollection 2020.
Bioinformatics. 2018 Mar 1;34(5):760-769. doi: 10.1093/bioinformatics/btx680.
4
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索,以分析海量数据集。
Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.
5
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.
6
The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库:迈向更可持续的未来。
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.
7
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.用于深度蛋白质组学和基因组学的生物序列连续分布式表示
PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.
8
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.UniRef聚类:一种用于改进序列相似性搜索的全面且可扩展的替代方法。
Bioinformatics. 2015 Mar 15;31(6):926-32. doi: 10.1093/bioinformatics/btu739. Epub 2014 Nov 13.
9
Locating proteins in the cell using TargetP, SignalP and related tools.使用TargetP、SignalP及相关工具在细胞中定位蛋白质。
Nat Protoc. 2007;2(4):953-71. doi: 10.1038/nprot.2007.131.
10
The Gene Ontology (GO) database and informatics resource.基因本体论(GO)数据库及信息资源。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61. doi: 10.1093/nar/gkh036.