Wymant Chris, Blanquart François, Golubchik Tanya, Gall Astrid, Bakker Margreet, Bezemer Daniela, Croucher Nicholas J, Hall Matthew, Hillebregt Mariska, Ong Swee Hoe, Ratmann Oliver, Albert Jan, Bannert Norbert, Fellay Jacques, Fransen Katrien, Gourlay Annabelle, Grabowski M Kate, Gunsenheimer-Bartmeyer Barbara, Günthard Huldrych F, Kivelä Pia, Kouyos Roger, Laeyendecker Oliver, Liitsola Kirsi, Meyer Laurence, Porter Kholoud, Ristola Matti, van Sighem Ard, Berkhout Ben, Cornelissen Marion, Kellam Paul, Reiss Peter, Fraser Christophe
Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
Medical Research Council Centre for Outbreak Analysis and Modelling, Department of Infectious Disease Epidemiology, Imperial College London, London, UK.
Virus Evol. 2018 May 18;4(1):vey007. doi: 10.1093/ve/vey007. eCollection 2018 Jan.
Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.
研究病毒的进化及其分子流行病学依赖于准确的病毒序列数据,以便能够有意义地解读相似病毒之间的微小差异。尽管下一代测序具有更高的通量和更详细的少数变异体数据,但尚未在HIV研究中广泛应用。在宿主间和宿主内存在较大差异(包括频繁的插入和缺失)的情况下,从读取片段(DNA短片段)中准确重建准种的共有序列存在困难,这可能构成了一个障碍。特别是,将读取片段比对到参考序列会导致信息的偏向性丢失;这种偏差可能会扭曲流行病学和进化结论。从头组装通过将读取片段与自身比对来避免这种偏差,生成一组称为重叠群的序列。然而,重叠群仅提供了读取片段的部分汇总信息,错误组装可能导致其结构不正确,并且在无法组装重叠群的基因组部分没有可用信息。为了解决这些问题,我们开发了工具shiver,用于对读取片段进行质量和污染预处理,然后使用校正后的重叠群并辅以用户选择的现有参考序列,将它们比对到针对该样本定制的参考序列上。每个样本只需运行两条命令,它就可以轻松用于大型异质数据集。我们使用shiver从Illumina平台产生的双端短读长全基因组数据中重建共有序列和少数变异体信息,用于65个现有的公开可用样本和50个新样本。我们展示了与将相同读取片段比对到3249个真实参考序列中最接近的序列相比,将其比对到shiver构建的参考序列上具有系统性优势:分别有13个碱基比对结果不同且更准确、0个碱基比对结果不同且更不准确,以及205个缺失序列得以恢复。我们还成功地将shiver应用于丙型肝炎病毒和呼吸道合胞病毒的全基因组样本。shiver可从https://github.com/ChrisHIV/shiver公开获取。