Chen Jian, Yang Le, Li Lu, Goodison Steve, Sun Yijun
Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, USA.
Department of Microbiology and Immunology, University at Buffalo, Buffalo, NY 14203, USA.
Bioinform Adv. 2022 Oct 21;2(1):vbac077. doi: 10.1093/bioadv/vbac077. eCollection 2022.
Quantifying pairwise sequence similarities is a key step in metagenomics studies. Alignment-free methods provide a computationally efficient alternative to alignment-based methods for large-scale sequence analysis. Several neural network-based methods have recently been developed for this purpose. However, existing methods do not perform well on sequences of varying lengths and are sensitive to the presence of insertions and deletions. In this article, we describe the development of a new method, referred to as AsMac that addresses the aforementioned issues. We proposed a novel neural network structure for approximate string matching for the extraction of pertinent information from biological sequences and developed an efficient gradient computation algorithm for training the constructed neural network. We performed a large-scale benchmark study using real-world data that demonstrated the effectiveness and potential utility of the proposed method.
The open-source software for the proposed method and trained neural-network models for some commonly used metagenomics marker genes were developed and are freely available at www.acsu.buffalo.edu/~yijunsun/lab/AsMac.html.
Supplementary data are available at online.
量化成对序列相似性是宏基因组学研究中的关键步骤。对于大规模序列分析,无比对方法为基于比对的方法提供了一种计算效率更高的替代方案。最近已经为此目的开发了几种基于神经网络的方法。然而,现有方法在长度不同的序列上表现不佳,并且对插入和缺失的存在很敏感。在本文中,我们描述了一种新方法AsMac的开发,该方法解决了上述问题。我们提出了一种新颖的神经网络结构用于近似字符串匹配,以从生物序列中提取相关信息,并开发了一种有效的梯度计算算法来训练构建的神经网络。我们使用真实世界数据进行了大规模基准研究,证明了所提出方法的有效性和潜在实用性。
已开发出所提出方法的开源软件以及针对一些常用宏基因组学标记基因的训练好的神经网络模型,可在www.acsu.buffalo.edu/~yijunsun/lab/AsMac.html上免费获取。
补充数据可在网上获取。