Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
Bioinformatics. 2013 Oct 1;29(19):2427-34. doi: 10.1093/bioinformatics/btt418. Epub 2013 Aug 13.
Hidden Markov model, based on Li and Stephens model that takes into account chromosome sharing of multiple individuals, results in mainstream haplotype phasing algorithms for genotyping arrays and next-generation sequencing (NGS) data. However, existing methods based on this model assume that the allele count data are independently observed at individual sites and do not consider haplotype informative reads, i.e. reads that cover multiple heterozygous sites, which carry useful haplotype information. In our previous work, we developed a new hidden Markov model to incorporate a two-site joint emission term that captures the haplotype information across two adjacent sites. Although our model improves the accuracy of genotype calling and haplotype phasing, haplotype information in reads covering non-adjacent sites and/or more than two adjacent sites is not used because of the severe computational burden.
We develop a new probabilistic model for genotype calling and haplotype phasing from NGS data that incorporates haplotype information of multiple adjacent and/or non-adjacent sites covered by a read over an arbitrary distance. We develop a new hybrid Markov Chain Monte Carlo algorithm that combines the Gibbs sampling algorithm of HapSeq and Metropolis-Hastings algorithm and is computationally feasible. We show by simulation and real data from the 1000 Genomes Project that our model offers superior performance for haplotype phasing and genotype calling for population NGS data over existing methods.
HapSeq2 is available at www.ssg.uab.edu/hapseq/.
基于 Li 和 Stephens 模型的隐马尔可夫模型考虑了多个个体的染色体共享,这导致了主流的基于基因分型阵列和下一代测序(NGS)数据的单倍型定相算法。然而,现有的基于该模型的方法假设等位基因计数数据是在个体位点上独立观察的,并且不考虑单倍型信息读取,即覆盖多个杂合位点的读取,这些读取携带有用的单倍型信息。在我们之前的工作中,我们开发了一种新的隐马尔可夫模型,该模型纳入了一个双位点联合发射项,用于捕获两个相邻位点之间的单倍型信息。尽管我们的模型提高了基因型调用和单倍型定相的准确性,但由于计算负担过重,覆盖非相邻位点和/或两个以上相邻位点的读取中的单倍型信息未被使用。
我们开发了一种新的概率模型,用于从 NGS 数据中进行基因型调用和单倍型定相,该模型整合了由读取覆盖任意距离的多个相邻和/或非相邻位点的单倍型信息。我们开发了一种新的混合马尔可夫链蒙特卡罗算法,该算法结合了 HapSeq 的 Gibbs 抽样算法和 Metropolis-Hastings 算法,并且计算上是可行的。通过模拟和来自 1000 基因组计划的真实数据,我们表明,与现有方法相比,我们的模型在群体 NGS 数据的单倍型定相和基因型调用方面提供了优越的性能。
HapSeq2 可在 www.ssg.uab.edu/hapseq/ 获得。