Department Computational Biology of Infection Research of the Helmholtz Centre for Infection Research.
Institute of Virology in Hannover Medical School.
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa123.
Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
人巨细胞病毒(HCMV)感染可导致免疫功能低下个体和先天性感染儿童发生严重并发症。通过高通量测序对临床标本中的异质病毒群体及其进化进行特征描述,需要准确组装个体毒株或序列变异体,并使用合适的变异体调用方法。然而,对于由具有大基因组的低分化病毒株组成的群体,大多数方法的性能尚未得到评估,例如 HCMV。在一项广泛的基准测试研究中,我们使用两种不同的文库制备方案,在 10 个实验室生成的基准数据集上评估了 15 个组装器和 6 个变异体调用程序,以确定分析此类数据的最佳实践和挑战。大多数组装器,尤其是 metaSPAdes 和 IVA,在恢复丰富的菌株方面,在一系列指标上表现良好。然而,只有一个,Savage,以高度碎片化的方式恢复了低丰度菌株。两个变异体调用程序,LoFreq 和 VarScan2,在所有菌株丰度上都表现出色。两者都共享大量假阳性变异体调用,这些变异体在“G.G”背景下强烈富集 T 到 G 的变化。这种上下文相关系统性错误的幅度与实验方案有关。我们提供所有基准测试数据、结果和名为 QuasiModo 的整个基准测试工作流程,Quasispecies Metric determination on omics,根据 GNU 通用公共许可证 v3.0(https://github.com/hzi-bifo/Quasimodo)发布,以实现对这些和其他数据的完全可重复性和进一步基准测试。