Suppr超能文献

ViraMiner:在原始 DNA 序列上进行深度学习,以鉴定人类样本中的病毒基因组。

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

机构信息

Computational Neuroscience Lab, Institute of Computer Science, University of Tartu, Tartu, Estonia.

Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden.

出版信息

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

Abstract

Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as "unknown" since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as "unknown" by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.

摘要

尽管具有重要的临床意义,但检测高度变异或未知的病毒仍然是一个主要挑战。当对人类样本进行测序时,由于许多序列与已知基因组不相似,因此传统的比对方法将许多组装的连续序列分类为“未知”。在这项工作中,我们开发了 ViraMiner,这是一种基于深度学习的方法,用于在各种人类生物样本中识别病毒。ViraMiner 包含两个卷积神经网络分支,旨在检测原始宏基因组连续序列上的模式和模式频率。训练数据集包括从 19 个宏基因组实验中获得的序列,这些序列通过 BLAST 进行了分析和标记。与其他病毒基因组分类的机器学习方法相比,该模型的准确性显著提高。使用 300bp 连续序列,ViraMiner 在 ROC 曲线下的面积达到 0.923。据我们所知,这是第一种可以在来自不同人类样本的原始宏基因组连续序列中检测病毒序列存在的机器学习方法。我们建议所提出的模型可以捕获基因组组成的不同类型的信息,并可以用作推荐系统,以进一步研究常规比对方法标记为“未知”的序列。探索这些高度变异的病毒可以增强我们对疾病感染原因的认识。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验