College of Communication Engineering, Jilin University, Changchun 130022, China.
Bioinformatics. 2022 Feb 7;38(5):1216-1222. doi: 10.1093/bioinformatics/btab845.
Viruses, the most abundant biological entities on earth, are important components of microbial communities, and as major human pathogens, they are responsible for human mortality and morbidity. The identification of viral sequences from metagenomes is critical for viral analysis. As massive quantities of short sequences are generated by next-generation sequencing, most methods utilize discrete and sparse one-hot vectors to encode nucleotide sequences, which are usually ineffective in viral identification.
In this article, Virtifier, a deep learning-based viral identifier for sequences from metagenomic data is proposed. It includes a meaningful nucleotide sequence encoding method named Seq2Vec and a variant viral sequence predictor with an attention-based long short-term memory (LSTM) network. By utilizing a fully trained embedding matrix to encode codons, Seq2Vec can efficiently extract the relationships among those codons in a nucleotide sequence. Combined with an attention layer, the LSTM neural network can further analyze the codon relationships and sift the parts that contribute to the final features. Experimental results of three datasets have shown that Virtifier can accurately identify short viral sequences (<500 bp) from metagenomes, surpassing three widely used methods, VirFinder, DeepVirFinder and PPR-Meta. Meanwhile, a comparable performance was achieved by Virtifier at longer lengths (>5000 bp).
A Python implementation of Virtifier and the Python code developed for this study have been provided on Github https://github.com/crazyinter/Seq2Vec. The RefSeq genomes in this article are available in VirFinder at https://dx.doi.org/10.1186/s40168-017-0283-5. The CAMI Challenge Dataset 3 CAMI_high dataset in this article is available in CAMI at https://data.cami-challenge.org/participate. The real human gut metagenomes in this article are available at https://dx.doi.org/10.1101/gr.142315.112.
Supplementary data are available at Bioinformatics online.
病毒是地球上最丰富的生物实体,是微生物群落的重要组成部分,作为主要的人类病原体,它们是导致人类死亡和发病的原因。从宏基因组中鉴定病毒序列对于病毒分析至关重要。由于下一代测序会产生大量的短序列,大多数方法都利用离散稀疏的独热向量来编码核苷酸序列,而这种方法在病毒鉴定中通常效果不佳。
本文提出了一种基于深度学习的宏基因组数据中病毒序列识别工具 Virtifier。它包括一种名为 Seq2Vec 的有意义的核苷酸序列编码方法和一种基于注意力机制的长短时记忆网络(LSTM)的变体病毒序列预测器。通过利用一个完全训练好的嵌入矩阵对密码子进行编码,Seq2Vec 可以有效地提取核苷酸序列中这些密码子之间的关系。结合注意力层,LSTM 神经网络可以进一步分析密码子关系,并筛选出对最终特征有贡献的部分。三个数据集的实验结果表明,Virtifier 可以准确识别来自宏基因组的短病毒序列(<500bp),优于三种广泛使用的方法 VirFinder、DeepVirFinder 和 PPR-Meta。同时,Virtifier 在更长的序列长度(>5000bp)上也能达到相当的性能。
Virtifier 的 Python 实现以及为这项研究开发的 Python 代码已在 Github 上提供 https://github.com/crazyinter/Seq2Vec。本文中的 RefSeq 基因组可在 VirFinder 中获得 https://dx.doi.org/10.1186/s40168-017-0283-5。本文中的 CAMI Challenge Dataset 3 CAMI_high 数据集可在 CAMI 中获得 https://data.cami-challenge.org/participate。本文中的真实人类肠道宏基因组可在 https://dx.doi.org/10.1101/gr.142315.112 中获得。
补充数据可在《生物信息学》在线获取。