在没有宿主参考基因组的高通量测序数据中进行病毒检测。

Virus detection in high-throughput sequencing data without a reference genome of the host.

机构信息

Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17p, Hannover 30559, Germany.

Research Center for Emerging Infections and Zoonoses, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17, Hannover 30559, Germany.

出版信息

Infect Genet Evol. 2018 Dec;66:180-187. doi: 10.1016/j.meegid.2018.09.026. Epub 2018 Oct 3.

Abstract

Discovery of novel viruses in host samples is a multidisciplinary process which relies increasingly on next-generation sequencing (NGS) followed by computational analysis. A crucial step in this analysis is to separate host sequence reads from the sequence reads of the virus to be discovered. This becomes especially difficult if no reference genome of the host is available. Furthermore, if the total number of viral reads in a sample is low, de novo assembly of a virus which is a requirement for most existing pipelines is hard to realize. We present a new modular, computational pipeline for discovery of novel viruses in host samples. While existing pipelines rely on the availability of the hosts reference genome for filtering sequence reads, our new pipeline can also cope with cases for which no reference genome is available. As a further novelty of our method a decoy module is used to assess false classification rates in the discovery process. Additionally, viruses with a low read coverage can be identified and visually reviewed. We validate our pipeline on simulated data as well as two experimental samples with known virus content. For the experimental samples, we were able to reproduce the laboratory findings. Our newly developed pipeline is applicable for virus detection in a wide range of host species. The three modules we present can either be incorporated individually in other pipelines or be used as a stand-alone pipeline. We are the first to present a decoy approach within a virus detection pipeline that can be used to assess error rates so that the quality of the final result can be judged. We provide an implementation of our modules via Github. However, the principle of the modules can easily be re-implemented by other researchers.

摘要

在宿主样本中发现新病毒是一个多学科的过程,越来越依赖于下一代测序(NGS)和随后的计算分析。在这种分析中,一个关键步骤是将宿主序列读取与要发现的病毒的序列读取分离。如果没有宿主的参考基因组,这就变得特别困难。此外,如果样本中病毒读取的总数较低,那么大多数现有管道所要求的病毒从头组装就很难实现。我们提出了一种新的模块化、计算性的宿主样本中新型病毒发现的管道。虽然现有的管道依赖于宿主参考基因组的可用性来过滤序列读取,但我们的新管道也可以处理没有参考基因组的情况。作为我们方法的另一个新颖之处,使用诱饵模块来评估发现过程中的错误分类率。此外,还可以识别和直观地检查覆盖率低的病毒。我们在模拟数据和两个具有已知病毒含量的实验样本上验证了我们的管道。对于实验样本,我们能够重现实验室的发现。我们新开发的管道适用于多种宿主物种的病毒检测。我们提出的三个模块可以单独集成到其他管道中,也可以作为独立的管道使用。我们是第一个在病毒检测管道中提出诱饵方法的人,该方法可用于评估错误率,从而可以判断最终结果的质量。我们通过 Github 提供了我们模块的实现。然而,其他研究人员可以很容易地重新实现这些模块的原理。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索