van Oeveren Jan, Janssen Antoine
Division of Bioinformatics, Keygene, Wageningen, NV, The Netherlands.
Methods Mol Biol. 2009;578:73-91. doi: 10.1007/978-1-60327-411-1_4.
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation and are the basis for most molecular markers. Before these SNPs can be used for direct sequence-based SNP detection or in a derived SNP assay, they need to be identified. For those regions or species where no validated SNPs are available in the public databases, a good alternative is to mine them from DNA sequences. The alignment of multiple sequence fragments originating from different genotypes representing the same region on the genome will allow for the discovery of sequence variants. The corresponding nucleotide mismatches are likely to be SNPs or insertions/deletions. A large amount of sequence data to be mined is present in the public databases (both expressed sequence tags and genomic sequences) and is free to use without having to do large-scale sequencing oneself. However, with the appearance of the next-generation sequencing machines (Roche GS/454, Illumina GA/Solexa, SOLiD), high-throughput sequencing is becoming widely available. This will allow for the sequencing of polymorphic genotypes on specific target areas and consequent SNP identification. In this paper we discuss the bioinformatics tools required to analyze DNA sequence data for SNP mining. A general approach for the consecutive steps in the mining process is described and commonly used SNP discovery pipelines are presented.
单核苷酸多态性(SNPs)是最丰富的遗传变异形式,也是大多数分子标记的基础。在这些单核苷酸多态性可用于基于直接序列的单核苷酸多态性检测或衍生的单核苷酸多态性分析之前,需要对它们进行识别。对于那些在公共数据库中没有经过验证的单核苷酸多态性的区域或物种,一个很好的替代方法是从DNA序列中挖掘它们。对源自代表基因组上同一区域的不同基因型的多个序列片段进行比对,将有助于发现序列变异。相应的核苷酸错配很可能是单核苷酸多态性或插入/缺失。公共数据库中存在大量有待挖掘的序列数据(包括表达序列标签和基因组序列),并且可以免费使用,无需自己进行大规模测序。然而,随着下一代测序仪(罗氏GS/454、Illumina GA/Solexa、SOLiD)的出现,高通量测序正变得广泛可用。这将允许对特定目标区域的多态基因型进行测序,并随之进行单核苷酸多态性识别。在本文中,我们讨论了分析DNA序列数据以挖掘单核苷酸多态性所需的生物信息学工具。描述了挖掘过程中连续步骤的一般方法,并介绍了常用的单核苷酸多态性发现流程。