Department of ECE, Indraprastha Institute of Information Technology-Delhi, New Delhi, India.
School of Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia.
PLoS One. 2020 Mar 13;15(3):e0216636. doi: 10.1371/journal.pone.0216636. eCollection 2020.
Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.
基于相似性的序列集合搜索是生物信息学的核心任务,在基因组时代的大部分时间里,它主要由精确和启发式的基于比对的算法主导。然而,即使是高效的启发式算法,如 BLAST,也可能无法扩展到现在出现的数据集,这促使人们开发了一系列基于无比对的替代方法,利用每个序列的基本词汇结构。在本文中,我们引入了两种监督方法——SuperVec 和 SuperVecX——来学习序列嵌入。这些方法扩展了早期基于表示学习(RepL)的方法,在训练过程中为每个序列包含与类别相关的信息。包含类别信息确保了相关的序列片段在目标空间中具有接近的表示,更好地反映了领域的结构。我们展示了通过这些方法学习的嵌入在(i)序列检索和(ii)分类任务中的质量。我们还提出了一种专门为序列检索问题设计的基于层次树的方法。根据它们各自对 SuperVec 或 SuperVecX 的使用,所得到的方法,我们称之为 H-SuperVec 或 H-SuperVecX,在一系列特征空间中学习嵌入,这些特征空间基于类标签的排他和穷尽子集。实验表明,与现有的(无监督)基于 RepL 的方法相比,所提出的方法在检索和分类任务中表现更好。此外,对于数据库检索任务,新方法比 BLAST 快一个数量级,支持混合方法,可以快速过滤集合,只保留潜在相关的记录。对原始数据库的这种过滤允许在更小的数据集上快速执行较慢但更准确的方法。因此,我们可以实现比以前更快的查询处理和更高的精度。