ViraMiner：在原始 DNA 序列上进行深度学习，以鉴定人类样本中的病毒基因组。

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

机构信息

Computational Neuroscience Lab, Institute of Computer Science, University of Tartu, Tartu, Estonia.

Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden.

出版信息

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

DOI:10.1371/journal.pone.0222271

PMID:31509583

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6738585/

Abstract

Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as "unknown" since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as "unknown" by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.

摘要

尽管具有重要的临床意义，但检测高度变异或未知的病毒仍然是一个主要挑战。当对人类样本进行测序时，由于许多序列与已知基因组不相似，因此传统的比对方法将许多组装的连续序列分类为“未知”。在这项工作中，我们开发了 ViraMiner，这是一种基于深度学习的方法，用于在各种人类生物样本中识别病毒。ViraMiner 包含两个卷积神经网络分支，旨在检测原始宏基因组连续序列上的模式和模式频率。训练数据集包括从 19 个宏基因组实验中获得的序列，这些序列通过 BLAST 进行了分析和标记。与其他病毒基因组分类的机器学习方法相比，该模型的准确性显著提高。使用 300bp 连续序列，ViraMiner 在 ROC 曲线下的面积达到 0.923。据我们所知，这是第一种可以在来自不同人类样本的原始宏基因组连续序列中检测病毒序列存在的机器学习方法。我们建议所提出的模型可以捕获基因组组成的不同类型的信息，并可以用作推荐系统，以进一步研究常规比对方法标记为“未知”的序列。探索这些高度变异的病毒可以增强我们对疾病感染原因的认识。

相似文献

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

Machine Learning for detection of viral sequences in human metagenomic datasets.

BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.

Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.

Prediction of virus-host infectious association by supervised learning methods.

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.

Extraordinary diversity of viruses in deep-sea sediments as revealed by metagenomics without prior virion separation.

Environ Microbiol. 2021 Feb;23(2):728-743. doi: 10.1111/1462-2920.15154. Epub 2020 Aug 3.

Unsupervised Binning of Metagenomic Assembled Contigs Using Improved Fuzzy C-Means Method.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1459-1467. doi: 10.1109/TCBB.2016.2576452. Epub 2016 Jun 7.

VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.

Microbiome. 2020 Jun 10;8(1):90. doi: 10.1186/s40168-020-00867-0.

Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.

Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.

Origins and challenges of viral dark matter.

Virus Res. 2017 Jul 15;239:136-142. doi: 10.1016/j.virusres.2017.02.002. Epub 2017 Feb 9.

Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning.

Cells. 2019 Dec 14;8(12):1635. doi: 10.3390/cells8121635.

引用本文的文献

HPV-KITE: sequence analysis software for rapid HPV genotype detection.

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf155.

A review of neural networks for metagenomic binning.

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf065.

VITALdb: to select the best viroinformatics tools for a desired virus or application.

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf084.

A privacy-preserving dependable deep federated learning model for identifying new infections from genome sequences.

Sci Rep. 2025 Mar 1;15(1):7291. doi: 10.1038/s41598-025-89612-x.

VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf001.

DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.

BMC Bioinformatics. 2024 Oct 14;25(1):328. doi: 10.1186/s12859-024-05955-8.

Deepvirusclassifier: a deep learning tool for classifying SARS-CoV-2 based on viral subtypes within the coronaviridae family.

BMC Bioinformatics. 2024 Jul 5;25(1):231. doi: 10.1186/s12859-024-05754-1.

Hecatomb: an integrated software platform for viral metagenomics.

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae020.

Optimized model architectures for deep learning on genomic data.

Commun Biol. 2024 Apr 30;7(1):516. doi: 10.1038/s42003-024-06161-1.

VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae192.

本文引用的文献

Identifying viruses from metagenomic data using deep learning.

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

Machine Learning for detection of viral sequences in human metagenomic datasets.

BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.

MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins.

Front Genet. 2018 Aug 7;9:304. doi: 10.3389/fgene.2018.00304. eCollection 2018.

Massively Parallel Implementation of Sequence Alignment with Basic Local Alignment Search Tool Using Parallel Computing in Java Library.

J Comput Biol. 2018 Aug;25(8):871-881. doi: 10.1089/cmb.2018.0079. Epub 2018 Jul 13.

Extension of the viral ecology in humans using viral profile hidden Markov models.

PLoS One. 2018 Jan 19;13(1):e0190938. doi: 10.1371/journal.pone.0190938. eCollection 2018.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.

Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.

Deep learning for computational biology.

Mol Syst Biol. 2016 Jul 29;12(7):878. doi: 10.15252/msb.20156651.

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.

Genome Res. 2016 Jul;26(7):990-9. doi: 10.1101/gr.200535.115. Epub 2016 May 3.

Detection of DNA viruses in prostate cancer.

Sci Rep. 2016 Apr 28;6:25235. doi: 10.1038/srep25235.

Large-scale machine learning for metagenomics sequence classification.

Bioinformatics. 2016 Apr 1;32(7):1023-32. doi: 10.1093/bioinformatics/btv683. Epub 2015 Nov 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ViraMiner：在原始 DNA 序列上进行深度学习，以鉴定人类样本中的病毒基因组。

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献