Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, Germany.
BMC Genomics. 2012 Aug 22;13:417. doi: 10.1186/1471-2164-13-417.
Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions).
We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent 'read-backmapping' to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach.
We recommend applying our general 'two-step' mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
与经典的基因分型相比,靶向下一代测序(tNGS)可以定制设计以检测整个感兴趣的基因组区域,从而检测新的和已知的变体。为了降低每个样本的成本,一种方法是在样品富集之前对带有条形码的 NGS 文库进行混合。尽管如此,我们仍然缺乏对这种多路 tNGS 方法以及不断发展的分析工具的不同性能如何影响变体发现质量的全面了解。因此,我们评估了不同软件工具和分析方法对多路 tNGS 数据中单核苷酸多态性(SNP)发现的影响。为了生成我们自己的测试模型,我们在三个实验阶段(大肠杆菌基因、多路大肠杆菌和多路 HapMap BRCA1/2 区域)中结合了序列捕获方法和 NGS。
我们成功地富集了带有条形码的 NGS 文库,而不是基因组 DNA,在多路样品中实现了可重复的覆盖谱(高达 0.99 的 Pearson 相关系数),链偏差<10%。然而,SNP 调用质量受到工具和映射策略选择的极大影响。为了降低计算要求,我们比较了常规的全基因组映射和 SNP 调用与一种新的更快方法:目标区域映射,然后“回映射”到整个基因组,以降低假阳性率。因此,我们开发了一种组合映射管道,包括标准工具(BWA、SAMtools 等),并在来自 1000 基因组计划的公共 HiSeq2000 外显子数据上对其进行了测试。我们的管道为每个 HiSeq2000 外显子样本节省了 12 小时的运行时间,并比常规全基因组方法检测到了约 5%更多的 SNP。这表明,使用这两种方法可能会比仅使用常规方法发现更多潜在的新 SNP。
我们建议在 tNGS 中应用我们通用的“两步”映射方法,以更有效地发现 SNP。我们的研究还表明,计算 SNP 一致性并检查读取比对以获得更可靠的结果是有益的。