Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
Bioinformatics. 2012 Apr 1;28(7):938-46. doi: 10.1093/bioinformatics/bts047. Epub 2012 Jan 27.
Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional haplotype information overlooked by current methods.
In this article, we introduce a new Hidden Markov Model (HMM)-based method that can take into account jumping reads information across adjacent PPSs and implement it in the HapSeq program. Our method extends the HMM in Thunder and explicitly models jumping reads information as emission probabilities conditional on the states of adjacent PPSs. Our simulation results show that, compared to Thunder, HapSeq reduces the genotyping error rate by 30%, from 0.86% to 0.60%. The results from the 1000 Genomes Project show that HapSeq reduces the genotyping error rate by 12 and 9%, from 2.24% and 2.76% to 1.97% and 2.50% for individuals with European and African ancestry, respectively. We expect our program can improve genotyping qualities of the large number of ongoing and planned whole genome sequencing projects.
dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu
The software package HapSeq and its manual can be found and downloaded at www.ssg.uab.edu/hapseq/.
Supplementary data are available at Bioinformatics online.
低覆盖测序为全基因组测序提供了一种经济的策略。当对一组个体进行测序时,由于测序覆盖度低,基因型调用可能具有挑战性。基于连锁不平衡 (LD) 的基因分型调用细化对于提高准确性至关重要。当前基于 LD 的方法使用个体潜在多态性位点 (PPS) 的读取计数或基因型可能性。跨越多个 PPS 的读取(跳跃读取)可以提供当前方法忽略的额外单倍型信息。
在本文中,我们引入了一种新的基于隐马尔可夫模型 (HMM) 的方法,该方法可以考虑跨越相邻 PPS 的跳跃读取信息,并在 HapSeq 程序中实现它。我们的方法扩展了 Thunder 中的 HMM,并明确地将跳跃读取信息建模为条件于相邻 PPS 状态的发射概率。我们的模拟结果表明,与 Thunder 相比,HapSeq 将基因分型错误率从 0.86%降低到 0.60%,降低了 30%。来自 1000 基因组计划的结果表明,HapSeq 将具有欧洲和非洲血统个体的基因分型错误率分别降低了 12%和 9%,从 2.24%和 2.76%降低到 1.97%和 2.50%。我们期望我们的程序可以提高正在进行和计划的大量全基因组测序项目的基因分型质量。
dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu
HapSeq 软件包及其手册可在 www.ssg.uab.edu/hapseq/ 找到并下载。
补充数据可在 Bioinformatics 在线获得。