Department of Statistics, The Pennsylvania State University, 325 Thomas, University Park, PA 16802, USA.
Bioinformatics. 2013 Apr 1;29(7):878-85. doi: 10.1093/bioinformatics/btt065. Epub 2013 Feb 13.
Next-generation sequencing (NGS) technologies have enabled whole-genome discovery and analysis of genetic variants in many species of interest. Individuals are often sequenced at low coverage for detecting novel variants, phasing haplotypes and inferring population structures. Although several tools have been developed for SNP and genotype calling in NGS data, haplotype phasing is often done separately on the called genotypes.
We propose a dynamic Bayesian Markov model (DBM) for simultaneous genotype calling and haplotype phasing in low-coverage NGS data of unrelated individuals. Our method is fully probabilistic that produces consistent inference of genotypes, haplotypes and recombination probabilities. Using data from the 1000 Genomes Project, we demonstrate that DBM not only yields more accurate results than some popular methods, but also provides novel characterization of haplotype structures at the individual level for visualization, interpretation and comparison in downstream analysis. DBM is a powerful and flexible tool that can be applied to many sequencing studies. Its statistical framework can also be extended to accommodate broader scopes of data.
http://stat.psu.edu/∼yuzhang/software/dbm.tar.
Supplementary data are available at Bioinformatics online.
下一代测序(NGS)技术已经能够在许多感兴趣的物种中进行全基因组发现和遗传变异分析。个体通常以低覆盖率进行测序,以检测新的变异、相位单倍型并推断种群结构。尽管已经开发了几种用于 NGS 数据中 SNP 和基因型调用的工具,但单倍型相位通常是在调用的基因型上分别进行的。
我们提出了一种用于在无关个体的低覆盖率 NGS 数据中同时进行基因型调用和单倍型相位的动态贝叶斯马尔可夫模型(DBM)。我们的方法是完全概率的,可对基因型、单倍型和重组概率进行一致的推断。使用来自 1000 基因组计划的数据,我们证明 DBM 不仅比一些流行的方法产生更准确的结果,而且还提供了个体水平单倍型结构的新颖特征化,用于下游分析中的可视化、解释和比较。DBM 是一种强大且灵活的工具,可应用于许多测序研究。它的统计框架也可以扩展到更广泛的数据范围。
http://stat.psu.edu/∼yuzhang/software/dbm.tar。
补充数据可在 Bioinformatics 在线获得。