HapCompass：一种用于准确组装序列数据单倍型的快速循环基算法。

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data.

作者信息

Aguiar Derek, Istrail Sorin

机构信息

Department of Computer Science, Brown University, Providence RI 02912, USA.

出版信息

J Comput Biol. 2012 Jun;19(6):577-90. doi: 10.1089/cmb.2012.0084.

DOI:10.1089/cmb.2012.0084

PMID:22697235

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3375639/

Abstract

Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data--known as haplotype assembly--have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).

摘要

由于当前测序技术的局限性，基因组组装方法会产生单倍型相位不明确的组装结果。确定个体的单倍型相位在计算上具有挑战性，且实验成本高昂。然而，单倍型相位信息在许多生物信息学工作流程中至关重要，例如基因关联研究和基因组插补。目前从序列数据确定单倍型相位的计算方法——即所谓的单倍型组装——在处理大型（1000基因组类型）数据时难以产生准确结果，或者在受限的优化条件下运行，而考虑到现代高通量测序技术，这些优化条件并不现实。我们提出了一种名为HapCompass的新算法，用于对高密度测序的人类基因组数据进行单倍型组装。HapCompass算法在一个图上运行，其中单核苷酸多态性（SNP）为节点，边由序列读取定义，并被视为单倍型中共同出现的SNP等位基因的支持证据。在我们的图模型中，单倍型相位对应于生成树。我们在此图上定义了最小加权边移除优化，并开发了一种基于循环基局部优化的算法来解决冲突证据。然后，我们估计生成一条染色体的完整单倍型组装所需的测序量。利用这些估计值以及从基因组组装和单倍型相位借用的指标，我们比较了HapCompass、基因组分析工具包（Genome Analysis ToolKit）和HapCut在千人基因组计划和模拟数据上的准确性。我们表明，在各种数据和指标上，HapCompass的表现都显著更好。HapCompass可免费下载（www.brown.edu/Research/Istrail_Lab/）。

相似文献

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data.HapCompass：一种用于准确组装序列数据单倍型的快速循环基算法。

J Comput Biol. 2012 Jun;19(6):577-90. doi: 10.1089/cmb.2012.0084.

Haplotype assembly in polyploid genomes and identical by descent shared tracts.多倍体基因组中的单体型组装和同源共享片段。

Bioinformatics. 2013 Jul 1;29(13):i352-60. doi: 10.1093/bioinformatics/btt213.

Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data.利用跨越多个单核苷酸多态性的读取信息，从测序数据中推断单倍型。

Bioinformatics. 2013 Sep 15;29(18):2245-52. doi: 10.1093/bioinformatics/btt386. Epub 2013 Jul 3.

HaploMaker: An improved algorithm for rapid haplotype assembly of genomic sequences.HaploMaker：一种用于快速组装基因组序列单倍型的改进算法。

Gigascience. 2022 May 17;11. doi: 10.1093/gigascience/giac038.

Tumor haplotype assembly algorithms for cancer genomics.用于癌症基因组学的肿瘤单倍型组装算法。

Pac Symp Biocomput. 2014:3-14.

Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.通过序贯蒙特卡罗算法进行联合单倍型组装和基因型分型

BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8.

HapCUT2: A Method for Phasing Genomes Using Experimental Sequence Data.HapCUT2：一种使用实验序列数据进行基因组相位分析的方法。

Methods Mol Biol. 2023;2590:139-147. doi: 10.1007/978-1-0716-2819-5_9.

Decoding Genetic Variations: Communications-Inspired Haplotype Assembly.解码基因变异：受通信启发的单倍型组装

IEEE/ACM Trans Comput Biol Bioinform. 2016 May-Jun;13(3):518-30. doi: 10.1109/TCBB.2015.2462367.

HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.HapCUT：一种用于单倍型组装问题的高效且准确的算法。

Bioinformatics. 2008 Aug 15;24(16):i153-9. doi: 10.1093/bioinformatics/btn298.

GenHap: a novel computational method based on genetic algorithms for haplotype assembly.GenHap：一种基于遗传算法的新型单倍型组装计算方法。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):172. doi: 10.1186/s12859-019-2691-y.

引用本文的文献

ralphi: a deep reinforcement learning framework for haplotype assembly.拉尔菲：一种用于单倍型组装的深度强化学习框架。

bioRxiv. 2025 Feb 21:2025.02.17.638151. doi: 10.1101/2025.02.17.638151.

Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing.利用量子计算进行二倍体和多倍体基因组的单倍型解析组装。

Cell Rep Methods. 2024 May 20;4(5):100754. doi: 10.1016/j.crmeth.2024.100754. Epub 2024 Apr 12.

XHap: haplotype assembly using long-distance read correlations learned by transformers.XHap：利用通过变压器学习的长距离读段相关性进行单倍型组装。

Bioinform Adv. 2023 Nov 23;3(1):vbad169. doi: 10.1093/bioadv/vbad169. eCollection 2023.

Smooth Descent: A ploidy-aware algorithm to improve linkage mapping in the presence of genotyping errors.平滑下降法：一种在存在基因分型错误的情况下改进连锁图谱构建的倍性感知算法。

Front Genet. 2023 Mar 1;14:1049988. doi: 10.3389/fgene.2023.1049988. eCollection 2023.

Phylogenetic Analysis of Allotetraploid Species Using Polarized Genomic Sequences.利用极化基因组序列进行异源四倍体物种的系统发育分析。

Syst Biol. 2023 Jun 16;72(2):372-390. doi: 10.1093/sysbio/syad009.

PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data.多倍体单体型分析软件：基于双等位基因标记剂量数据的多倍体单体型分析。

BMC Bioinformatics. 2022 Oct 23;23(1):442. doi: 10.1186/s12859-022-04989-0.

HaploMaker: An improved algorithm for rapid haplotype assembly of genomic sequences.HaploMaker：一种用于快速组装基因组序列单倍型的改进算法。

Gigascience. 2022 May 17;11. doi: 10.1093/gigascience/giac038.

flopp: Extremely Fast Long-Read Polyploid Haplotype Phasing by Uniform Tree Partitioning.flopp：通过均匀树分区实现超快速长读多倍体单体型相位。

J Comput Biol. 2022 Feb;29(2):195-211. doi: 10.1089/cmb.2021.0436. Epub 2022 Jan 17.

Reconstruction of evolving gene variants and fitness from short sequencing reads.从短测序读长重建演化基因变异体和适应度

Nat Chem Biol. 2021 Nov;17(11):1188-1198. doi: 10.1038/s41589-021-00876-6. Epub 2021 Oct 11.

Haplotype threading: accurate polyploid phasing from long reads.单体型连接：长读长准确进行多倍体相位分析。

Genome Biol. 2020 Sep 21;21(1):252. doi: 10.1186/s13059-020-02158-1.

本文引用的文献

Exome sequencing and SNP analysis detect novel compound heterozygosity in fatty acid hydroxylase-associated neurodegeneration.外显子组测序和 SNP 分析检测到脂肪酸羟化酶相关神经退行性变中的新型复合杂合性。

Eur J Hum Genet. 2012 Apr;20(4):476-9. doi: 10.1038/ejhg.2011.222. Epub 2011 Dec 7.

Haplotype phasing: existing methods and new developments.单体型相位确定：现有方法和新进展。

Nat Rev Genet. 2011 Sep 16;12(10):703-14. doi: 10.1038/nrg3054.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.利用下一代 DNA 测序数据进行变异发现和基因分型的框架。

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

The importance of phase information for human genomics.相位信息对于人类基因组学的重要性。

Nat Rev Genet. 2011 Mar;12(3):215-23. doi: 10.1038/nrg2950. Epub 2011 Feb 8.

Haplotype phasing by multi-assembly of shared haplotypes: phase-dependent interactions between rare variants.通过共享单倍型的多组装进行单倍型定相：罕见变异之间的相位依赖性相互作用。

Pac Symp Biocomput. 2011:88-99. doi: 10.1142/9789814335058_0010.

A map of human genome variation from population-scale sequencing.人类基因组变异的图谱来自于基于人群的测序。

Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.

Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome.外显子组序列数据的同源性过滤鉴定出高磷酸血症性智力低下综合征中的 PIGV 突变。

Nat Genet. 2010 Oct;42(10):827-9. doi: 10.1038/ng.653. Epub 2010 Aug 29.

Optimal algorithms for haplotype assembly from whole-genome sequence data.从全基因组序列数据中进行单倍型组装的最优算法。

Bioinformatics. 2010 Jun 15;26(12):i183-90. doi: 10.1093/bioinformatics/btq215.

Genotype imputation for genome-wide association studies.全基因组关联研究中的基因型推断。

Nat Rev Genet. 2010 Jul;11(7):499-511. doi: 10.1038/nrg2796.

HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.HapCUT：一种用于单倍型组装问题的高效且准确的算法。

Bioinformatics. 2008 Aug 15;24(16):i153-9. doi: 10.1093/bioinformatics/btn298.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。