Al-Najim Abdullatif, Hauns Sven, Tran Van Dinh, Backofen Rolf, Alkhnbashi Omer S
Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran 34462, Saudi Arabia.
Bioinformatics group,Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 101, 79110 Freiburg, Germany.
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf037.
Bacteriophages are among the most abundant organisms on Earth, significantly impacting ecosystems and human society. The identification of viral sequences, especially novel ones, from mixed metagenomes is a critical first step in analyzing the viral components of host samples. This plays a key role in many downstream tasks. However, this is a challenging task due to their rapid evolution rate. The identification process typically involves two steps: distinguishing viral sequences from the host and identifying if they come from novel viral genomes. Traditional metagenomic techniques that rely on sequence similarity with known entities often fall short, especially when dealing with short or novel genomes. Meanwhile, deep learning has demonstrated its efficacy across various domains, including the bioinformatics field.
We have developed HVSeeker-a host/virus seeker method-based on deep learning to distinguish between bacterial and phage sequences. HVSeeker consists of two separate models: one analyzing DNA sequences and the other focusing on proteins. In addition to the robust architecture of HVSeeker, three distinct preprocessing methods were introduced to enhance the learning process: padding, contigs assembly, and sliding window. This method has shown promising results on sequences with various lengths, ranging from 200 to 1,500 base pairs. Tested on both NCBI and IMGVR databases, HVSeeker outperformed several methods from the literature such as Seeker, Rnn-VirSeeker, DeepVirFinder, and PPR-Meta. Moreover, when compared with other methods on benchmark datasets, HVSeeker has shown better performance, establishing its effectiveness in identifying unknown phage genomes.
These results demonstrate the exceptional structure of HVSeeker, which encompasses both the preprocessing methods and the model design. The advancements provided by HVSeeker are significant for identifying viral genomes and developing new therapeutic approaches, such as phage therapy. Therefore, HVSeeker serves as an essential tool in prokaryotic and phage taxonomy, offering a crucial first step toward analyzing the host-viral component of samples by identifying the host and viral sequences in mixed metagenomes.
噬菌体是地球上数量最为丰富的生物体之一,对生态系统和人类社会有着重大影响。从混合宏基因组中识别病毒序列,尤其是新的病毒序列,是分析宿主样本中病毒成分的关键第一步。这在许多下游任务中起着关键作用。然而,由于其快速的进化速度,这是一项具有挑战性的任务。识别过程通常包括两个步骤:将病毒序列与宿主区分开来,并确定它们是否来自新的病毒基因组。依赖与已知实体序列相似性的传统宏基因组技术往往存在不足,尤其是在处理短基因组或新基因组时。与此同时,深度学习已在包括生物信息学领域在内的各个领域证明了其有效性。
我们开发了HVSeeker——一种基于深度学习的宿主/病毒搜索方法,用于区分细菌和噬菌体序列。HVSeeker由两个独立的模型组成:一个分析DNA序列,另一个专注于蛋白质。除了HVSeeker强大的架构外,还引入了三种不同的预处理方法来增强学习过程:填充、重叠群组装和滑动窗口。该方法在长度从200到1500个碱基对的各种序列上都显示出了有前景的结果。在NCBI和IMGVR数据库上进行测试时,HVSeeker优于文献中的几种方法,如Seeker、Rnn-VirSeeker、DeepVirFinder和PPR-Meta。此外,与基准数据集上的其他方法相比,HVSeeker表现出了更好的性能,证明了其在识别未知噬菌体基因组方面的有效性。
这些结果证明了HVSeeker的卓越结构,它包括预处理方法和模型设计。HVSeeker带来的进展对于识别病毒基因组和开发新的治疗方法(如噬菌体疗法)具有重要意义。因此,HVSeeker是原核生物和噬菌体分类学中的一项重要工具,通过识别混合宏基因组中的宿主和病毒序列,为分析样本的宿主-病毒成分提供了关键的第一步。