Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17p, 30559 Hannover, Germany.
Institute for Virology and Immunobiology, University of Würzburg, Versbacher Straße 7, 97078 Würzburg, Germany.
Comput Biol Chem. 2021 Oct;94:107555. doi: 10.1016/j.compbiolchem.2021.107555. Epub 2021 Aug 2.
Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.
下一代测序技术常用于鉴定感染宿主的 DNA 或 RNA 样本中的病毒序列。大多数病毒检测管道的主要步骤是将序列读取与已知病毒基因组进行比对。由于相关病毒序列之间存在微小差异,并且存在多种生物学或技术误差,因此映射存在不确定性。因此,检测到的病毒列表可能缺乏稳健性。本文提出了一种新的方法,用于生成人工测序reads,并从原始发现中进行重新采样,以帮助评估最初识别的病毒列表的稳健性。从原始映射结果(SAM 文件)中,得出了一组统计分布。这些分布用于重新采样管道中,以生成新的人工reads,然后再次将其与参考基因组进行比对。通过总结重新采样过程,分析人员可以获得有关样本中特定病毒的存在是否获得或失去证据的信息,从而获得原始映射列表以及该列表中各个病毒的稳健性信息。为了判断稳健性,从重新采样过程中得出了几个指标,例如原始和重新采样读取计数之间的相关性,或者在读取计数差异中统计检测到异常值。此外,还提供了通过 Sankey 图显示读取计数变化的图形说明。为了演示新方法的使用,将重新采样方法应用于三个真实世界的数据样本,其中一个样本包含实验室确认的流感序列,以及人为生成的病毒序列被添加到宿主测序数据中的数据。通过应用重新采样管道,一些病毒从原始列表中消失,而新病毒出现,表明列表中保留的病毒具有稳健性。新方法的评估表明,重新采样方法有助于分析生物样本中的病毒含量,评估原始发现的稳健性,并更好地显示总体发现分布。该方法也适用于其他基于读取映射的病毒检测管道。