Hapo-G：利用准确读段对基因组组装进行单倍型感知的优化

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads.

作者信息

Aury Jean-Marc, Istace Benjamin

机构信息

Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France.

出版信息

NAR Genom Bioinform. 2021 May 3;3(2):lqab034. doi: 10.1093/nargab/lqab034. eCollection 2021 Jun.

DOI:10.1093/nargab/lqab034

PMID:33987534

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8092372/

Abstract

Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

摘要

单分子测序技术最近已由太平洋生物科学公司和牛津纳米孔公司商业化，有望对长DNA片段（千碱基到兆碱基级别）进行测序，然后使用高效算法，在重复区域的连续性和完整性方面提供高质量的组装结果。然而，长读长技术的错误率高于短读长技术。这对基因组组装的碱基质量有直接影响，特别是在编码区域，测序错误可能会破坏基因的编码框架。对于二倍体基因组，给定基因的共有序列可能是两种单倍型的混合，可能导致过早的终止密码子。已经开发了几种使用短读长来优化基因组组装的方法，通常，它们逐个检查核苷酸，并对输入组装的每个核苷酸进行校正。因此，这些算法无法正确处理二倍体基因组，它们通常会从一种单倍型切换到另一种单倍型。在此，我们提出了Hapo-G（基因组单倍型感知优化），这是一种新算法，能够整合来自高质量读长（短读长或长读长）的定相信息，以优化基因组组装，特别是二倍体和杂合基因组的组装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e956/8092372/070a67763da9/lqab034fig1.jpg

相似文献

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads.Hapo-G：利用准确读段对基因组组装进行单倍型感知的优化

NAR Genom Bioinform. 2021 May 3;3(2):lqab034. doi: 10.1093/nargab/lqab034. eCollection 2021 Jun.

Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses.用 Illumina 短读序列对牛津纳米孔长读序列组装的细菌病原体进行打磨，以改进基因组分析。

Genomics. 2021 May;113(3):1366-1377. doi: 10.1016/j.ygeno.2021.03.018. Epub 2021 Mar 11.

Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing.利用 MinION 长读测序技术评估组装多种细菌基因组的策略。

BMC Genomics. 2019 Jan 9;20(1):23. doi: 10.1186/s12864-018-5381-7.

Genome assembly using Nanopore-guided long and error-free DNA reads.使用纳米孔引导的长且无错误的DNA reads进行基因组组装。

BMC Genomics. 2015 Apr 20;16(1):327. doi: 10.1186/s12864-015-1519-z.

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction.对一种药用植物多倍体基因组上的纳米孔测序（ONT）技术和环形一致序列（CCS）测序技术进行比较后发现，ONT读数的高错误率不适用于自我校正。

Chin Med. 2022 Aug 9;17(1):94. doi: 10.1186/s13020-022-00644-1.

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case.利用长读长和短读数据组装叶绿体基因组：以白千层作为测试案例的方法比较。

BMC Genomics. 2018 Dec 29;19(1):977. doi: 10.1186/s12864-018-5348-8.

Evaluation of assembly methods combining long-reads and short-reads to obtain sp. R4 high-quality complete genome.评估结合长读长和短读长以获得sp. R4高质量完整基因组的组装方法。

3 Biotech. 2020 Nov;10(11):480. doi: 10.1007/s13205-020-02474-0. Epub 2020 Oct 19.

Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps.利用纳米孔长读长和光学图谱进行植物基因组的染色体级别的组装。

Nat Plants. 2018 Nov;4(11):879-887. doi: 10.1038/s41477-018-0289-4. Epub 2018 Nov 2.

Benchmarking multi-platform sequencing technologies for human genome assembly.多平台测序技术在人类基因组组装中的基准测试。

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad300.

de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.使用牛津纳米孔MinION测序仪对天然酵母分离株进行从头组装和群体基因组调查。

Gigascience. 2017 Feb 1;6(2):1-13. doi: 10.1093/gigascience/giw018.

引用本文的文献

Chromosome-level genome assembly of the Vermilion Snapper (Rhomboplites aurorubens).红鲷（Rhomboplites aurorubens）的染色体水平基因组组装

Sci Data. 2025 Jul 23;12(1):1281. doi: 10.1038/s41597-025-05573-w.

Interplay between large low-recombining regions and pseudo-overdominance in a plant genome.植物基因组中大型低重组区域与假超显性之间的相互作用。

Nat Commun. 2025 Jul 12;16(1):6458. doi: 10.1038/s41467-025-61529-z.

Harboring Starships: The Accumulation of Large Horizontal Gene Transfers in Domesticated and Pathogenic Fungi.容纳星际飞船：驯化真菌和致病真菌中大量水平基因转移的积累

Genome Biol Evol. 2025 Jul 3;17(7). doi: 10.1093/gbe/evaf125.

Chromosome-scale assemblies of three Ormosia species: repetitive sequences distribution and structural rearrangement.三种红豆属植物的染色体水平组装：重复序列分布与结构重排

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf047.

A conserved terpene cyclase gene in Sanghuangporus for abscisic acid-related sesquiterpenoid biosynthesis.桑黄中一个与脱落酸相关倍半萜生物合成有关的保守萜烯环化酶基因。

BMC Genomics. 2025 Apr 15;26(1):378. doi: 10.1186/s12864-025-11542-9.

Duckweed genomes and epigenomes underlie triploid hybridization and clonal reproduction.浮萍的基因组和表观基因组是三倍体杂交和克隆繁殖的基础。

Curr Biol. 2025 Apr 21;35(8):1828-1847.e9. doi: 10.1016/j.cub.2025.03.013. Epub 2025 Apr 1.

Chromosome-level genome assembly of a doubled haploid brook trout (Salvelinus fontinalis).双倍体溪红点鲑（Salvelinus fontinalis）的染色体水平基因组组装

G3 (Bethesda). 2025 Jun 4;15(6). doi: 10.1093/g3journal/jkaf066.

Insights into the genomic and phenotypic diversity of Monosporozyma unispora strains isolated from anthropic environments.对从人类环境中分离出的单孢单囊菌菌株的基因组和表型多样性的见解。

FEMS Yeast Res. 2025 Jan 30;25. doi: 10.1093/femsyr/foaf016.

Draft genome sequences of five glacial fungi from Styx Glacier, Antarctica.来自南极洲Styx冰川的五种冰川真菌的基因组序列草图

Microbiol Resour Announc. 2025 Apr 10;14(4):e0117524. doi: 10.1128/mra.01175-24. Epub 2025 Feb 27.

On the way to diploidization and unexpected ploidy in the grass Sporobolus section Spartina mesopolyploids.在禾本科鼠尾粟属米草多倍体向二倍体化及意外倍性转变的过程中。

Nat Commun. 2025 Feb 26;16(1):1997. doi: 10.1038/s41467-025-56983-8.

本文引用的文献

Long-read assembly of the Brassica napus reference genome Darmor-bzh.甘蓝型油菜参考基因组 Darmor-bzh 的长读序列组装。

Gigascience. 2020 Dec 15;9(12). doi: 10.1093/gigascience/giaa137.

metaFlye: scalable long-read metagenome assembly using repeat graphs.metaFlye：使用重复图进行可扩展的长读长宏基因组组装。

Nat Methods. 2020 Nov;17(11):1103-1110. doi: 10.1038/s41592-020-00971-x. Epub 2020 Oct 5.

A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data.用于从测序数据中快速识别 SARS-CoV-2、其他病毒和微生物的计算工具集。

Brief Bioinform. 2021 Mar 22;22(2):924-935. doi: 10.1093/bib/bbaa231.

Haplotype-resolved genome analyses of a heterozygous diploid potato.杂合二倍体马铃薯的单体型解析基因组分析。

Nat Genet. 2020 Oct;52(10):1018-1023. doi: 10.1038/s41588-020-0699-x. Epub 2020 Sep 28.

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies.基因组精修工具 POLCA 可快速准确地对基因组组装进行修正。

PLoS Comput Biol. 2020 Jun 26;16(6):e1007981. doi: 10.1371/journal.pcbi.1007981. eCollection 2020 Jun.

Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato.广泛的结构变异对番茄基因表达和作物改良的主要影响。

Cell. 2020 Jul 9;182(1):145-161.e23. doi: 10.1016/j.cell.2020.05.021. Epub 2020 Jun 17.

Gapless assembly of maize chromosomes using long-read technologies.利用长读长技术实现玉米染色体的无缝组装。

Genome Biol. 2020 May 20;21(1):121. doi: 10.1186/s13059-020-02029-9.

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.阿波罗：一种与测序技术无关、可扩展且准确的组装后处理算法。

Bioinformatics. 2020 Jun 1;36(12):3669-3679. doi: 10.1093/bioinformatics/btaa179.

Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics.多份拟南芥基因组的染色体水平组装揭示了具有改变进化动态的重排热点。

Nat Commun. 2020 Feb 20;11(1):989. doi: 10.1038/s41467-020-14779-y.

Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus.八个高质量基因组揭示了甘蓝型油菜的泛基因组结构和生态型分化。

Nat Plants. 2020 Jan;6(1):34-45. doi: 10.1038/s41477-019-0577-7. Epub 2020 Jan 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Hapo-G：利用准确读段对基因组组装进行单倍型感知的优化

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献