IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):763-774. doi: 10.1109/TCBB.2022.3161135. Epub 2023 Feb 3.
Metagenome sequencing provides an unprecedented opportunity for the discovery of unknown microbes and viruses. A large number of phages and prokaryotes are mixed together in metagenomes. To study the influence of phages on human bodies and environments, it is of great significance to isolate phages from metagenomes. However, it is difficult to identify novel phages because of the diversity of their sequences and the frequent presence of short contigs in metagenomes. Here, virSearcher is developed to identify phages from metagenomes by combining the convolutional neural network (CNN) and the gene information of input sequences. Firstly, an input sequence is encoded in accordance with the different functions of its coding and the non-coding regions and then is converted into word embedding code through a word embedding layer before a convolutional layer. Meanwhile, the hit ratio of the virus genes is combined with the output of the CNN to further improve the performance of the network. The genes used by virSearcher consist of complete and incomplete genes. Experiments on several metagenomes have showed that, compared with others, virSearcher can significantly improve the performance for the identification of short sequences, while maintaining the performance for long ones. The source code of virSearcher is freely available from http://github.com/DrJackson18/virSearcher.
宏基因组测序为发现未知微生物和病毒提供了前所未有的机会。大量的噬菌体和原核生物混合在宏基因组中。为了研究噬菌体对人体和环境的影响,从宏基因组中分离噬菌体具有重要意义。然而,由于噬菌体序列的多样性和宏基因组中短序列的频繁出现,很难识别新的噬菌体。在这里,我们开发了 virSearcher,通过将卷积神经网络(CNN)和输入序列的基因信息相结合,从宏基因组中识别噬菌体。首先,根据编码和非编码区域的不同功能对输入序列进行编码,然后通过词嵌入层将其转换为词嵌入码。同时,将病毒基因的命中率与 CNN 的输出相结合,进一步提高网络的性能。virSearcher 使用的基因包括完整和不完整的基因。对几个宏基因组的实验表明,与其他方法相比,virSearcher 可以显著提高短序列识别的性能,同时保持长序列的性能。virSearcher 的源代码可从 http://github.com/DrJackson18/virSearcher 免费获得。