Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China.
Biol Direct. 2012 Jun 8;7:17. doi: 10.1186/1745-6150-7-17.
Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.
Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents.
The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.
单核苷酸多态性(SNPs)是真核生物基因组中最丰富的遗传变异类型,最近已成为广泛的生态和进化研究中首选的标记。下一代测序(NGS)技术的出现使得在没有或有限基因组资源的非模式生物中高效地对大量 SNPs 进行基因分型成为可能。然而,大多数基于 NGS 的基因分型方法都需要参考基因组来进行准确的 SNP 调用。然而,在没有参考基因组的情况下,很少有努力致力于开发或改进准确 SNP 调用的算法。
在这里,我们描述了一种改进的最大似然(ML)算法,称为 iML,它可以在没有参考基因组的非模式生物中实现 SNP 调用的高基因分型准确性。iML 算法结合了混合泊松/正态模型来检测复合读取簇,并可以有效地防止由于重复基因组区域而导致的不正确的 SNP 调用。通过对模拟和真实测序数据集的分析,我们证明与 ML 或阈值方法相比,iML 可以显著提高从头 SNP 基因分型的准确性,并且对于具有高重复含量的二倍体基因组的无参考基因分型特别有效。
iML 算法可以有效地防止由于重复基因组区域而导致的不正确的 SNP 调用,从而通过实现更高的基因分型准确性而优于原始 ML 算法。因此,我们的算法非常适合在没有参考基因组的非模式生物中进行准确的从头 SNP 基因分型。