使用 FASTQ 文件中的自举样本测量病毒宏基因组分析的可重复性。

Measuring reproducibility of virus metagenomics analyses using bootstrap samples from FASTQ-files.

机构信息

Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Hannover D-30559, Germany.

Institute for Terrestrial and Aquatic Wildlife Research, University of Veterinary Medicine Hannover, Hannover D-30559, Germany.

出版信息

Bioinformatics. 2021 May 23;37(8):1068-1075. doi: 10.1093/bioinformatics/btaa926.

DOI:10.1093/bioinformatics/btaa926

PMID:33135067

Abstract

MOTIVATION

High-throughput sequencing data can be affected by different technical errors, e.g. from probe preparation or false base calling. As a consequence, reproducibility of experiments can be weakened. In virus metagenomics, technical errors can result in falsely identified viruses in samples from infected hosts. We present a new resampling approach based on bootstrap sampling of sequencing reads from FASTQ-files in order to generate artificial replicates of sequencing runs which can help to judge the robustness of an analysis. In addition, we evaluate a mixture model on the distribution of read counts per virus to identify potentially false positive findings.

RESULTS

The evaluation of our approach on an artificially generated dataset with known viral sequence content shows in general a high reproducibility of uncovering viruses in sequencing data, i.e. the correlation between original and mean bootstrap read count was highly correlated. However, the bootstrap read counts can also indicate reduced or increased evidence for the presence of a virus in the biological sample. We also found that the mixture-model fits well to the read counts, and furthermore, it provides a higher accuracy on the original or on the bootstrap read counts than on the difference between both. The usefulness of our methods is further demonstrated on two freely available real-world datasets from harbor seals.

AVAILABILITY AND IMPLEMENTATION

We provide a Phyton tool, called RESEQ, available from https://github.com/babaksaremi/RESEQ that allows efficient generation of bootstrap reads from an original FASTQ-file.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量测序数据可能会受到不同技术误差的影响，例如探针制备或碱基误报。因此，实验的可重复性可能会减弱。在病毒宏基因组学中，技术误差可能会导致从感染宿主样本中错误识别病毒。我们提出了一种新的基于从 FASTQ 文件中对测序reads 进行自举抽样的重采样方法，以便生成测序运行的人工副本，这有助于判断分析的稳健性。此外，我们评估了病毒reads 计数分布的混合模型，以识别潜在的假阳性发现。

结果

我们在具有已知病毒序列内容的人工生成数据集上评估了该方法，结果表明，在测序数据中发现病毒的重现性通常较高，即原始和平均自举读计数之间的相关性高度相关。然而，自举读计数也可能表明在生物样本中病毒的存在证据减少或增加。我们还发现，混合模型非常适合reads 计数，并且与原始或自举读计数相比，它在差异方面提供了更高的准确性。我们的方法在两个来自港湾海豹的免费真实数据集上的应用进一步证明了其有用性。