Zhou Shuntai, Jones Corbin, Mieczkowski Piotr, Swanstrom Ronald
UNC Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA Carolina Center for Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
J Virol. 2015 Aug;89(16):8540-55. doi: 10.1128/JVI.00522-15. Epub 2015 Jun 3.
Validating the sampling depth and reducing sequencing errors are critical for studies of viral populations using next-generation sequencing (NGS). We previously described the use of Primer ID to tag each viral RNA template with a block of degenerate nucleotides in the cDNA primer. We now show that low-abundance Primer IDs (offspring Primer IDs) are generated due to PCR/sequencing errors. These artifactual Primer IDs can be removed using a cutoff model for the number of reads required to make a template consensus sequence. We have modeled the fraction of sequences lost due to Primer ID resampling. For a typical sequencing run, less than 10% of the raw reads are lost to offspring Primer ID filtering and resampling. The remaining raw reads are used to correct for PCR resampling and sequencing errors. We also demonstrate that Primer ID reveals bias intrinsic to PCR, especially at low template input or utilization. cDNA synthesis and PCR convert ca. 20% of RNA templates into recoverable sequences, and 30-fold sequence coverage recovers most of these template sequences. We have directly measured the residual error rate to be around 1 in 10,000 nucleotides. We use this error rate and the Poisson distribution to define the cutoff to identify preexisting drug resistance mutations at low abundance in an HIV-infected subject. Collectively, these studies show that >90% of the raw sequence reads can be used to validate template sampling depth and to dramatically reduce the error rate in assessing a genetically diverse viral population using NGS.
Although next-generation sequencing (NGS) has revolutionized sequencing strategies, it suffers from serious limitations in defining sequence heterogeneity in a genetically diverse population, such as HIV-1 due to PCR resampling and PCR/sequencing errors. The Primer ID approach reveals the true sampling depth and greatly reduces errors. Knowing the sampling depth allows the construction of a model of how to maximize the recovery of sequences from input templates and to reduce resampling of the Primer ID so that appropriate multiplexing can be included in the experimental design. With the defined sampling depth and measured error rate, we are able to assign cutoffs for the accurate detection of minority variants in viral populations. This approach allows the power of NGS to be realized without having to guess about sampling depth or to ignore the problem of PCR resampling, while also being able to correct most of the errors in the data set.
验证采样深度和减少测序错误对于使用下一代测序(NGS)研究病毒群体至关重要。我们之前描述了使用引物ID在cDNA引物中用一段简并核苷酸标记每个病毒RNA模板。我们现在表明,由于PCR/测序错误会产生低丰度的引物ID(子代引物ID)。这些人为产生的引物ID可以使用一个截止模型去除,该模型用于确定生成模板一致序列所需的读数数量。我们已经对由于引物ID重新采样而丢失的序列比例进行了建模。对于一次典型的测序运行,不到10%的原始读数会因子代引物ID过滤和重新采样而丢失。其余的原始读数用于校正PCR重新采样和测序错误。我们还证明,引物ID揭示了PCR固有的偏差,尤其是在低模板输入或利用率的情况下。cDNA合成和PCR可将约20%的RNA模板转化为可回收序列,30倍的序列覆盖度可回收大多数这些模板序列。我们直接测量的残留错误率约为每10000个核苷酸中有1个错误。我们使用这个错误率和泊松分布来定义截止值,以识别HIV感染个体中低丰度的预先存在的耐药性突变。总体而言,这些研究表明,超过90%的原始序列读数可用于验证模板采样深度,并显著降低使用NGS评估基因多样化病毒群体时的错误率。
尽管下一代测序(NGS)彻底改变了测序策略,但由于PCR重新采样和PCR/测序错误,在定义基因多样化群体(如HIV-1)中的序列异质性方面存在严重局限性。引物ID方法揭示了真实的采样深度并大大减少了错误。了解采样深度有助于构建一个模型,该模型用于说明如何最大限度地从输入模板中回收序列,并减少引物ID的重新采样,以便在实验设计中纳入适当的多重分析。有了定义的采样深度和测量的错误率,我们能够为准确检测病毒群体中的少数变异体设定截止值。这种方法能够充分发挥NGS的能力,而无需猜测采样深度或忽略PCR重新采样问题,同时还能够校正数据集中的大多数错误。