Peng Cheng, Shang Jiayu, Guan Jiaojiao, Wang Donglin, Sun Yanni
Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China.
Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong (SAR), China.
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae704.
Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.
In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By using the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.
The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.
病毒广泛存在且具有高度多样性,在生态系统和公共卫生中发挥着关键作用。准确识别各种生态系统中的病毒对于了解其多样性和评估其生态影响至关重要。宏基因组测序已成为检测各种生态系统中病毒的主要策略。然而,在宏基因组数据中准确全面地检测病毒仍然困难。有限的参考序列使得基于比对的方法难以识别新型病毒。基于机器学习的工具在新型病毒检测方面更具前景,但往往会遗漏典型宏基因组数据中丰富存在的短病毒重叠群。现有工具产生的病毒搜索结果不一致,这进一步凸显了迫切需要一种更强大的病毒识别工具。
在这项工作中,我们开发了ViraLM用于识别宏基因组数据中的新型病毒重叠群。通过使用最新的基因组基础模型作为主干,并在严格构建的数据集上进行训练,该模型能够根据学到的基因组特征将病毒与其他生物区分开来。我们在多个数据集上对ViraLM进行了全面测试,实验结果表明ViraLM在不同场景下优于现有工具。特别是,ViraLM将短重叠群的F1分数提高了22%。
ViraLM的源代码可通过以下链接获取:https://github.com/ChengPENG-wolf/ViraLM。