Statistical Genetics, Max Planck Institute of Psychiatry, Kraepelinstrasse 2-10, 80804 Munich, Germany.
Hum Genet. 2012 Oct;131(10):1541-54. doi: 10.1007/s00439-012-1213-z. Epub 2012 Aug 11.
High-throughput DNA sequencing (HTS) is of increasing importance in the life sciences. One of its most prominent applications is the sequencing of whole genomes or targeted regions of the genome such as all exonic regions (i.e., the exome). Here, the objective is the identification of genetic variants such as single nucleotide polymorphisms (SNPs). The extraction of SNPs from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. We review the essential building blocks for a pipeline that calls SNPs from raw HTS data. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. The final steps of the pipeline include the SNP calling procedure along with filtering of SNP candidates. The steps of this pipeline are accompanied by an analysis of a publicly available whole-exome sequencing dataset. To this end, we employ several alignment programs and SNP calling routines for highlighting the fact that the choice of the tools significantly affects the final results.
高通量 DNA 测序(HTS)在生命科学中变得越来越重要。它最突出的应用之一是对整个基因组或基因组的靶向区域(例如所有外显子区域,即外显子组)进行测序。在这里,目标是识别遗传变异,如单核苷酸多态性(SNP)。从原始遗传序列中提取 SNP 需要涉及许多处理步骤和应用各种工具。我们回顾了从原始 HTS 数据中调用 SNP 的管道的基本构建块。该管道包括质量控制、短读段与参考基因组的映射、对齐的可视化和后处理,包括碱基质量重新校准。管道的最后步骤包括 SNP 调用过程以及 SNP 候选者的过滤。该管道的步骤伴随着对公开可用的全外显子组测序数据集的分析。为此,我们使用了几种对齐程序和 SNP 调用例程,以突出工具的选择会显著影响最终结果这一事实。