一种用于从全基因组序列数据中进行单倍型组装的马尔可夫链蒙特卡罗算法。

An MCMC algorithm for haplotype assembly from whole-genome sequence data.

作者信息

Bansal Vikas, Halpern Aaron L, Axelrod Nelson, Bafna Vineet

机构信息

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA.

出版信息

Genome Res. 2008 Aug;18(8):1336-46. doi: 10.1101/gr.077065.108.

DOI:10.1101/gr.077065.108

PMID:18676820

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2493424/

Abstract

In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).

摘要

与基因型相比，关于单倍型（存在于一条染色体上的等位基因组合）的知识对于全基因组关联研究以及推断人类进化历史更为有用。单倍型通常使用计算方法从群体基因型数据中推断出来。全基因组序列数据是构建个体跨越数百千碱基的单倍型的一个有前景的资源。在本文中，我们提出一种马尔可夫链蒙特卡罗（MCMC）算法，即HASH（单个人类的单倍型组装），用于从已映射到参考基因组组装的测序DNA片段中组装单倍型。马尔可夫链的转移是通过对源自测序片段的图进行最小割计算来生成的。我们已应用我们的方法，使用来自最近测序的一个人类个体的全基因组鸟枪法序列数据来推断单倍型。高序列覆盖度和配对末端的存在导致相当长的单倍型（N50长度约为350 kb）。基于将测序片段与个体单倍型进行比较，我们证明使用HASH推断的该个体的单倍型比使用先前提出的贪婪启发式算法和简单MCMC方法估计的单倍型显著更准确。使用来自HapMap项目的单倍型，我们估计使用HASH推断的单倍型的切换错误率相当低，约为1.1%。我们的马尔可夫链蒙特卡罗算法代表了一个用于单倍型组装的通用框架，可应用于由其他测序技术生成的序列数据。实现这些方法的代码和分阶段的个体单倍型可从（http://www.cse.ucsd.edu/users/vibansal/HASH/）下载。

相似文献

An MCMC algorithm for haplotype assembly from whole-genome sequence data.

Genome Res. 2008 Aug;18(8):1336-46. doi: 10.1101/gr.077065.108.

Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.

BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8.

HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.

Bioinformatics. 2008 Aug 15;24(16):i153-9. doi: 10.1093/bioinformatics/btn298.

Optimal algorithms for haplotype assembly from whole-genome sequence data.

Bioinformatics. 2010 Jun 15;26(12):i183-90. doi: 10.1093/bioinformatics/btq215.

An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments.

Bioinformatics. 2018 Jan 1;34(1):155-162. doi: 10.1093/bioinformatics/btx436.

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

Genome Res. 2017 May;27(5):801-812. doi: 10.1101/gr.213462.116. Epub 2016 Dec 9.

Joint haplotype phasing and genotype calling of multiple individuals using haplotype informative reads.

Bioinformatics. 2013 Oct 1;29(19):2427-34. doi: 10.1093/bioinformatics/btt418. Epub 2013 Aug 13.

Association of genetic traits to estimated haplotypes from SNP genotypes using EM algorithm and Markov Chain Monte Carlo techniques.

Hum Hered. 2003;56(1-3):41-7. doi: 10.1159/000073731.

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data.

J Comput Biol. 2012 Jun;19(6):577-90. doi: 10.1089/cmb.2012.0084.

Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data.

J Comput Biol. 2013 Feb;20(2):80-92. doi: 10.1089/cmb.2012.0091.

引用本文的文献

Inferring compound heterozygosity from large-scale exome sequencing data.

Nat Genet. 2024 Jan;56(1):152-161. doi: 10.1038/s41588-023-01608-3. Epub 2023 Dec 6.

Pairwise comparative analysis of six haplotype assembly methods based on users' experience.

BMC Genom Data. 2023 Jun 29;24(1):35. doi: 10.1186/s12863-023-01134-5.

Inferring compound heterozygosity from large-scale exome sequencing data.

bioRxiv. 2023 Aug 21:2023.03.19.533370. doi: 10.1101/2023.03.19.533370.

Haplotyping-Assisted Diploid Assembly and Variant Detection with Linked Reads.

Methods Mol Biol. 2023;2590:161-182. doi: 10.1007/978-1-0716-2819-5_11.

HapCUT2: A Method for Phasing Genomes Using Experimental Sequence Data.

Methods Mol Biol. 2023;2590:139-147. doi: 10.1007/978-1-0716-2819-5_9.

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C.

Nat Commun. 2021 Apr 28;12(1):1935. doi: 10.1038/s41467-020-20536-y.

Linked-Read Whole Genome Sequencing Solves a Double Gene Rearrangement.

Genes (Basel). 2021 Jan 21;12(2):133. doi: 10.3390/genes12020133.

A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model.

PLoS One. 2020 Oct 29;15(10):e0241291. doi: 10.1371/journal.pone.0241291. eCollection 2020.

NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering.

BMC Bioinformatics. 2020 Oct 22;21(1):475. doi: 10.1186/s12859-020-03775-0.

Gene-Level, but Not Chromosome-Wide, Divergence between a Very Young House Fly Proto-Y Chromosome and Its Homologous Proto-X Chromosome.

Mol Biol Evol. 2021 Jan 23;38(2):606-618. doi: 10.1093/molbev/msaa250.

本文引用的文献

The complete genome of an individual by massively parallel DNA sequencing.

Nature. 2008 Apr 17;452(7189):872-6. doi: 10.1038/nature06884.

Next-generation sequencing: the race is on.

Cell. 2008 Mar 7;132(5):721-3. doi: 10.1016/j.cell.2008.02.028.

The impact of next-generation sequencing technology on genetics.

Trends Genet. 2008 Mar;24(3):133-41. doi: 10.1016/j.tig.2007.12.007. Epub 2008 Feb 11.

DNA sequencing. A plan to capture human diversity in 1000 genomes.

Science. 2008 Jan 25;319(5862):395. doi: 10.1126/science.319.5862.395.

Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures.

Proc Natl Acad Sci U S A. 2008 Jan 29;105(4):1176-81. doi: 10.1073/pnas.0710982105. Epub 2008 Jan 23.

Next-generation sequencing transforms today's biology.

Nat Methods. 2008 Jan;5(1):16-8. doi: 10.1038/nmeth1156. Epub 2007 Dec 19.

Genome-wide detection and characterization of positive selection in human populations.

Nature. 2007 Oct 18;449(7164):913-8. doi: 10.1038/nature06250.

A second generation human haplotype map of over 3.1 million SNPs.

Nature. 2007 Oct 18;449(7164):851-61. doi: 10.1038/nature06258.

The diploid genome sequence of an individual human.

PLoS Biol. 2007 Sep 4;5(10):e254. doi: 10.1371/journal.pbio.0050254.

A new multipoint method for genome-wide association studies by imputation of genotypes.

Nat Genet. 2007 Jul;39(7):906-13. doi: 10.1038/ng2088. Epub 2007 Jun 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于从全基因组序列数据中进行单倍型组装的马尔可夫链蒙特卡罗算法。

An MCMC algorithm for haplotype assembly from whole-genome sequence data.

作者信息

Bansal Vikas, Halpern Aaron L, Axelrod Nelson, Bafna Vineet

机构信息

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA.

出版信息

Genome Res. 2008 Aug;18(8):1336-46. doi: 10.1101/gr.077065.108.

DOI:10.1101/gr.077065.108

PMID:18676820

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2493424/

Abstract

摘要

一种用于从全基因组序列数据中进行单倍型组装的马尔可夫链蒙特卡罗算法。

An MCMC algorithm for haplotype assembly from whole-genome sequence data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种用于从全基因组序列数据中进行单倍型组装的马尔可夫链蒙特卡罗算法。

An MCMC algorithm for haplotype assembly from whole-genome sequence data.

作者信息

机构信息

出版信息