Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China.
Nucleic Acids Res. 2013 Jul;41(13):e136. doi: 10.1093/nar/gkt372. Epub 2013 May 21.
Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).
454 和 Ion Torrent 测序仪都能够产生大量的长高质量测序读段。然而,由于这两种方法在一个循环中都对同聚物进行测序,因此它们都受到同聚物不确定性和掺入不同步的影响。在映射过程中,这些测序错误可能会在同聚物周围移动比对,从而导致不正确的错配,这已成为准确检测单核苷酸多态性(SNP)的关键障碍。在本文中,我们提出了一个隐马尔可夫模型(HMM),通过过度调用、调用不足、插入和删除来统计和明确地描述同聚物测序错误。我们使用层次模型来描述测序和碱基调用过程,并通过期望最大化算法从重测序数据中估计 HMM 的参数。基于 HMM,我们开发了一种基于重排的 SNP 调用程序,称为 PyroHMMsnp,它根据错误模型重新排列同聚物周围的读序列,然后通过贝叶斯方法推断潜在的基因型。模拟实验表明,与其他工具相比,PyroHMMsnp 在各种测序覆盖率下的敏感性、特异性和 F1 度量方面的性能都非常出色。对人类重测序数据的分析表明,PyroHMMsnp 比 Samtools 预测出 12.9%更多的 SNP,同时特异性更高。(http://code.google.com/p/pyrohmmsnp/)。