Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA.
Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; Department of Computer Science, University of Vermont, Burlington, VT 05405, USA; Neuroscience, Behavior, Health Initiative, University of Vermont, Burlington, VT 05405, USA.
Genomics. 2021 Jan;113(1 Pt 2):1189-1198. doi: 10.1016/j.ygeno.2020.12.004. Epub 2020 Dec 7.
Numerous viral sequences have been reported in the whole-genome sequencing (WGS) data of human blood. However, it is not clear to what degree the virus-mappable reads represent true viral sequences rather than random-mapping or noise originating from sample preparation, sequencing processes, or other sources. Identification of patterns of virus-mappable reads may generate novel indicators for evaluating the origins of these viral sequences. We characterized paired-end unmapped reads and reads aligned to viral references in human WGS datasets, then compared patterns of the virus-mappable reads among DNA sources and sequencing facilities which produced these datasets. We then examined potential origins of the source- and facility-associated viral reads. The proportions of clean unmapped reads among the seven sequencing facilities were significantly different (P < 2 × 10). We identified 260,339 reads that were mappable to a total of 99 viral references in 2535 samples. The majority (86.7%) of these virus-mappable reads (corresponding to 47 viral references), which can be classified into four groups based on their distinct patterns, were strongly associated with sequencing facility or DNA source (adjusted P value <0.01). Possible origins of these reads include artificial sequences in library preparation, recombinant vectors in cell culture, and phages co-contaminated with their host bacteria. The sequencing facility-associated virus-mappable reads and patterns were repeatedly observed in other datasets produced in the same facilities. We have constructed an analytic framework and profiled the unmapped reads mappable to viral references. The results provide a new understanding of sequencing facility- and DNA source-associated batch effects in deep sequencing data and may facilitate improved bioinformatics filtering of reads.
在人类全基因组测序(WGS)数据中已经报道了许多病毒序列。然而,尚不清楚可映射病毒的读段在何种程度上代表真正的病毒序列,而不是来自样本制备、测序过程或其他来源的随机映射或噪声。识别可映射病毒读段的模式可能会产生新的指标,用于评估这些病毒序列的来源。我们对人类 WGS 数据集的未配对末端读段和与病毒参考序列比对的读段进行了特征描述,然后比较了产生这些数据集的 DNA 来源和测序设施中可映射病毒的读段模式。接着,我们研究了这些来源和设施相关病毒读段的潜在来源。七个测序设施之间的清洁未配对读段比例存在显著差异(P<2×10)。我们在 2535 个样本中共鉴定出 260339 个可映射到 99 个病毒参考序列的读段。这些可映射病毒的读段(对应 47 个病毒参考序列)中,大多数(86.7%)基于其独特的模式可分为四组,与测序设施或 DNA 来源强烈相关(调整后的 P 值<0.01)。这些读段的可能来源包括文库制备中的人工序列、细胞培养中的重组载体,以及与其宿主细菌共污染的噬菌体。在同一设施中产生的其他数据集也反复观察到与测序设施相关的可映射病毒的读段和模式。我们构建了一个分析框架,并对可映射到病毒参考序列的未配对读段进行了分析。这些结果为深入测序数据中与测序设施和 DNA 来源相关的批次效应提供了新的认识,并可能有助于改进对读段的生物信息学过滤。