Baaijens Jasmijn A, Aabidine Amal Zine El, Rivals Eric, Schönhuth Alexander
Centrum Wiskunde and Informatica, 1098 XG Amsterdam, Netherlands.
LIRMM, CNRS and Université de Montpellier, 34095 Montpellier, France.
Genome Res. 2017 May;27(5):835-848. doi: 10.1101/gr.215038.116. Epub 2017 Apr 10.
A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.
病毒准种是指感染个体体内存在的病毒株集合,其具有高度的多样性。为了对病毒的毒力、发病机制和治疗方案选择进行最佳评估,确定各个病毒株的单倍型可能起着关键作用。由于许多病毒具有高突变率和重组率,在新疾病爆发时,通常无法获得高质量的参考基因组。我们提出了SAVAGE,这是一种计算工具,用于在无需高质量参考基因组的情况下重建宿主内病毒株的个体单倍型。SAVAGE利用基于FM索引的数据结构或临时构建的共有参考序列,从患者样本数据构建重叠图。在这个重叠图中,节点代表读段和/或重叠群,而边则反映基于合理统计考量,两条读段/重叠群代表相同的单倍型序列。通过一种迭代方案,一种基于对读段/重叠群进行统计校准的组枚举的新重叠组装算法,随后从这个重叠图中高效地重建个体单倍型。在对模拟数据和真实深度覆盖数据进行的基准实验中,SAVAGE的表现大幅优于通用的从头组装器以及目前唯一可用的专门用于病毒准种的从头组装器。当基于临时构建的共有参考序列运行时,与最先进的参考基因组引导工具相比,SAVAGE的性能也非常出色。我们还将SAVAGE分别应用于感染寨卡病毒和丙型肝炎病毒患者的两个深度覆盖样本,这为各自病毒准种的遗传结构提供了线索。