基于读段单倍型信息进行下一代测序数据的基因型推断。

Genotype calling from next-generation sequencing data using haplotype information of reads.

机构信息

Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.

出版信息

Bioinformatics. 2012 Apr 1;28(7):938-46. doi: 10.1093/bioinformatics/bts047. Epub 2012 Jan 27.

DOI:10.1093/bioinformatics/bts047

PMID:22285565

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3493122/

Abstract

MOTIVATION

Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional haplotype information overlooked by current methods.

RESULTS

In this article, we introduce a new Hidden Markov Model (HMM)-based method that can take into account jumping reads information across adjacent PPSs and implement it in the HapSeq program. Our method extends the HMM in Thunder and explicitly models jumping reads information as emission probabilities conditional on the states of adjacent PPSs. Our simulation results show that, compared to Thunder, HapSeq reduces the genotyping error rate by 30%, from 0.86% to 0.60%. The results from the 1000 Genomes Project show that HapSeq reduces the genotyping error rate by 12 and 9%, from 2.24% and 2.76% to 1.97% and 2.50% for individuals with European and African ancestry, respectively. We expect our program can improve genotyping qualities of the large number of ongoing and planned whole genome sequencing projects.

CONTACT

dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu

AVAILABILITY

The software package HapSeq and its manual can be found and downloaded at www.ssg.uab.edu/hapseq/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

低覆盖测序为全基因组测序提供了一种经济的策略。当对一组个体进行测序时，由于测序覆盖度低，基因型调用可能具有挑战性。基于连锁不平衡 (LD) 的基因分型调用细化对于提高准确性至关重要。当前基于 LD 的方法使用个体潜在多态性位点 (PPS) 的读取计数或基因型可能性。跨越多个 PPS 的读取（跳跃读取）可以提供当前方法忽略的额外单倍型信息。

结果

在本文中，我们引入了一种新的基于隐马尔可夫模型 (HMM) 的方法，该方法可以考虑跨越相邻 PPS 的跳跃读取信息，并在 HapSeq 程序中实现它。我们的方法扩展了 Thunder 中的 HMM，并明确地将跳跃读取信息建模为条件于相邻 PPS 状态的发射概率。我们的模拟结果表明，与 Thunder 相比，HapSeq 将基因分型错误率从 0.86%降低到 0.60%，降低了 30%。来自 1000 基因组计划的结果表明，HapSeq 将具有欧洲和非洲血统个体的基因分型错误率分别降低了 12%和 9%，从 2.24%和 2.76%降低到 1.97%和 2.50%。我们期望我们的程序可以提高正在进行和计划的大量全基因组测序项目的基因分型质量。

联系信息

dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu

可用性

HapSeq 软件包及其手册可在 www.ssg.uab.edu/hapseq/ 找到并下载。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Genotype calling from next-generation sequencing data using haplotype information of reads.基于读段单倍型信息进行下一代测序数据的基因型推断。

Bioinformatics. 2012 Apr 1;28(7):938-46. doi: 10.1093/bioinformatics/bts047. Epub 2012 Jan 27.

Joint haplotype phasing and genotype calling of multiple individuals using haplotype informative reads.利用单倍型信息读长对多个个体进行联合单倍型相位确定和基因型调用。

Bioinformatics. 2013 Oct 1;29(19):2427-34. doi: 10.1093/bioinformatics/btt418. Epub 2013 Aug 13.

Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data.利用跨越多个单核苷酸多态性的读取信息，从测序数据中推断单倍型。

Bioinformatics. 2013 Sep 15;29(18):2245-52. doi: 10.1093/bioinformatics/btt386. Epub 2013 Jul 3.

Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads.基于连锁不平衡的低覆盖度鸟枪法测序数据的基因型调用。

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S53. doi: 10.1186/1471-2105-12-S1-S53.

Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold.使用下一代测序reads 和单倍型支架进行基因型调用和相位分析。

Bioinformatics. 2013 Jan 1;29(1):84-91. doi: 10.1093/bioinformatics/bts632. Epub 2012 Oct 23.

A computational method for genotype calling in family-based sequencing data.一种用于基于家系测序数据进行基因型分型的计算方法。

BMC Bioinformatics. 2016 Jan 16;17:37. doi: 10.1186/s12859-016-0880-5.

Reveel: large-scale population genotyping using low-coverage sequencing data.Reveel：使用低覆盖度测序数据进行大规模人群基因分型。

Bioinformatics. 2016 Jun 1;32(11):1686-96. doi: 10.1093/bioinformatics/btv530. Epub 2015 Sep 9.

A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing.一种用于下一代测序中相位和特征分析单倍型的动态贝叶斯马尔可夫模型。

Bioinformatics. 2013 Apr 1;29(7):878-85. doi: 10.1093/bioinformatics/btt065. Epub 2013 Feb 13.

A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium.一种通过利用连锁不平衡，在存在等位基因不平衡的情况下从RNA序列数据确定cSNP基因型的隐马尔可夫方法。

BMC Bioinformatics. 2015 Feb 22;16:61. doi: 10.1186/s12859-015-0479-2.

On the design and analysis of next-generation sequencing genotyping for a cohort with haplotype-informative reads.关于具有单倍型信息性读段的队列的下一代测序基因分型的设计与分析。

Methods. 2015 Jun;79-80:41-6. doi: 10.1016/j.ymeth.2015.01.016. Epub 2015 Jan 30.

引用本文的文献

Pairwise comparative analysis of six haplotype assembly methods based on users' experience.基于用户体验的六种单倍型组装方法的两两比较分析。

BMC Genom Data. 2023 Jun 29;24(1):35. doi: 10.1186/s12863-023-01134-5.

Likelihood-based complex trait association testing for arbitrary depth sequencing data.针对任意深度测序数据的基于似然性的复杂性状关联测试。

Bioinformatics. 2015 Sep 15;31(18):2955-62. doi: 10.1093/bioinformatics/btv307. Epub 2015 May 14.

Methods. 2015 Jun;79-80:41-6. doi: 10.1016/j.ymeth.2015.01.016. Epub 2015 Jan 30.

Detection of Mendelian consistent genotyping errors in pedigrees.家系中孟德尔一致基因分型错误的检测。

Genet Epidemiol. 2014 May;38(4):291-9. doi: 10.1002/gepi.21806. Epub 2014 Apr 9.

HapFABIA: identification of very short segments of identity by descent characterized by rare variants in large sequencing data.HapFABIA：通过在大型测序数据中鉴定罕见变异来识别具有同源性的非常短的片段。

Nucleic Acids Res. 2013 Dec;41(22):e202. doi: 10.1093/nar/gkt1013. Epub 2013 Oct 29.

Quantifying population genetic differentiation from next-generation sequencing data.从下一代测序数据中定量群体遗传分化。

Genetics. 2013 Nov;195(3):979-92. doi: 10.1534/genetics.113.154740. Epub 2013 Aug 26.

Joint haplotype phasing and genotype calling of multiple individuals using haplotype informative reads.利用单倍型信息读长对多个个体进行联合单倍型相位确定和基因型调用。

Bioinformatics. 2013 Oct 1;29(19):2427-34. doi: 10.1093/bioinformatics/btt418. Epub 2013 Aug 13.

MixSIH: a mixture model for single individual haplotyping.MixSIH：一种用于个体单体型分析的混合模型。

BMC Genomics. 2013;14 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-14-S2-S5. Epub 2013 Feb 15.

Bioinformatics. 2013 Apr 1;29(7):878-85. doi: 10.1093/bioinformatics/btt065. Epub 2013 Feb 13.

本文引用的文献

Genotype and SNP calling from next-generation sequencing data.从下一代测序数据中进行基因型和单核苷酸多态性（SNP）的调用。

Nat Rev Genet. 2011 Jun;12(6):443-51. doi: 10.1038/nrg2986.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.利用下一代 DNA 测序数据进行变异发现和基因分型的框架。

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

Low-coverage sequencing: implications for design of complex trait association studies.低覆盖度测序：对复杂性状关联研究设计的影响。

Genome Res. 2011 Jun;21(6):940-51. doi: 10.1101/gr.117259.110. Epub 2011 Apr 1.

Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads.基于连锁不平衡的低覆盖度鸟枪法测序数据的基因型调用。

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S53. doi: 10.1186/1471-2105-12-S1-S53.

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.MaCH：利用序列和基因型数据来估计单倍型和未观测基因型。

Genet Epidemiol. 2010 Dec;34(8):816-34. doi: 10.1002/gepi.20533.

A map of human genome variation from population-scale sequencing.人类基因组变异的图谱来自于基于人群的测序。

Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.

SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples.从多个二倍体样本的低覆盖测序数据中进行 SNP 检测和基因分型。

Genome Res. 2011 Jun;21(6):952-60. doi: 10.1101/gr.113084.110. Epub 2010 Oct 27.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.基因组分析工具包：一种用于分析下一代 DNA 测序数据的 MapReduce 框架。

Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.

Sequencing technologies - the next generation.测序技术——下一代。

Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8.

Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.同时进行基因型调用和单倍型相位分析可提高全基因组关联研究的基因型准确性，并减少假阳性关联。

Am J Hum Genet. 2009 Dec;85(6):847-61. doi: 10.1016/j.ajhg.2009.11.004.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。