Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA.
Department of Oral Biology, University at Buffalo, The State University of New York, Buffalo, NY, USA.
Bioinformatics. 2019 Jun 1;35(11):1820-1828. doi: 10.1093/bioinformatics/bty887.
Sequence analysis is arguably a foundation of modern biology. Classic approaches to sequence analysis are based on sequence alignment, which is limited when dealing with large-scale sequence data. A dozen of alignment-free approaches have been developed to provide computationally efficient alternatives to alignment-based approaches. However, existing methods define sequence similarity based on various heuristics and can only provide rough approximations to alignment distances.
In this article, we developed a new approach, referred to as SENSE (SiamEse Neural network for Sequence Embedding), for efficient and accurate alignment-free sequence comparison. The basic idea is to use a deep neural network to learn an explicit embedding function based on a small training dataset to project sequences into an embedding space so that the mean square error between alignment distances and pairwise distances defined in the embedding space is minimized. To the best of our knowledge, this is the first attempt to use deep learning for alignment-free sequence analysis. A large-scale experiment was performed that demonstrated that our method significantly outperformed the state-of-the-art alignment-free methods in terms of both efficiency and accuracy.
Open-source software for the proposed method is developed and freely available at https://www.acsu.buffalo.edu/∼yijunsun/lab/SENSE.html.
Supplementary data are available at Bioinformatics online.
序列分析可以说是现代生物学的基础。经典的序列分析方法基于序列比对,而当处理大规模序列数据时,这种方法存在局限性。已经开发了十几种无比对方法,为基于比对的方法提供了计算效率更高的替代方法。然而,现有的方法基于各种启发式方法来定义序列相似性,并且只能对比对距离提供粗略的近似。
在本文中,我们开发了一种新的方法,称为 SENSE(基于暹罗 Ese 神经网络的序列嵌入),用于高效准确的无比对序列比较。基本思想是使用深度神经网络基于小的训练数据集学习显式嵌入函数,将序列投影到嵌入空间中,使得在嵌入空间中定义的比对距离和成对距离之间的均方误差最小化。据我们所知,这是首次尝试将深度学习用于无比对序列分析。进行了大规模实验,结果表明,我们的方法在效率和准确性方面均显著优于最先进的无比对方法。
拟议方法的开源软件已开发完成,并可在 https://www.acsu.buffalo.edu/∼yijunsun/lab/SENSE.html 上免费获得。
补充数据可在生物信息学在线获得。