利用MaSuRCA巨读算法对面包小麦的祖先之一——[具体物种名称未给出]的大型高度重复基因组进行混合组装。

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

作者信息

Zimin Aleksey V, Puiu Daniela, Luo Ming-Cheng, Zhu Tingting, Koren Sergey, Marçais Guillaume, Yorke James A, Dvořák Jan, Salzberg Steven L

机构信息

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.

Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.

出版信息

Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.

DOI:10.1101/gr.213405.116

PMID:28130360

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5411773/

Abstract

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct , which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species , a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

摘要

单分子测序技术生成的长测序读段为大幅提高基因组组装的连续性提供了可能。如今最大的挑战在于长读段的错误率相对较高，目前约为15%。如此高的错误率使得难以单独使用这些数据，尤其是对于高度重复的植物基因组。原始数据中的错误可能导致一致性基因组序列中出现插入或缺失错误（插入缺失），进而给下游分析带来重大问题；例如，单个插入缺失可能会改变阅读框并错误地截断蛋白质序列。在此，我们描述一种算法，该算法通过将长的、高错误率的读段与短但准确得多的Illumina测序读段（其错误率平均<1%）相结合来解决高错误率问题。我们的混合组装算法将这两种类型的读段结合起来构建既长又准确的超级读段，然后使用专为长读段设计的CABOG组装器来组装这些超级读段。我们将此技术应用于来自某物种的Illumina和PacBio序列的大数据集，该物种是一个大型且极度重复的植物基因组，此前的组装尝试均未成功。我们表明，最终得到的组装重叠群远比之前的任何组装结果大，N50重叠群大小为486,807个核苷酸。我们将这些重叠群与独立生成的光学图谱进行比较以评估其大规模准确性，并与一组基于高质量细菌人工染色体（BAC）的组装结果进行比较以评估碱基水平的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62b6/5411773/20de78cdfbf9/787f01.jpg

相似文献

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.利用MaSuRCA巨读算法对面包小麦的祖先之一——[具体物种名称未给出]的大型高度重复基因组进行混合组装。

Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.

Highly accurate long reads are crucial for realizing the potential of biodiversity genomics.高质量的长读长序列对于实现生物多样性基因组学的潜力至关重要。

BMC Genomics. 2023 Mar 16;24(1):117. doi: 10.1186/s12864-023-09193-9.

Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data.利用光学图谱和染色体构象捕获数据改进和校正三种植物物种长读长基因组组装的连续性

Genome Res. 2017 May;27(5):778-786. doi: 10.1101/gr.213652.116. Epub 2017 Feb 3.

QuorUM: An Error Corrector for Illumina Reads.QuorUM：Illumina测序读数的纠错工具

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.使用 Illumina 和 Oxford Nanopore 测序对细菌病原体进行基因组分析的混合组装方法的基准测试。

BMC Genomics. 2020 Sep 14;21(1):631. doi: 10.1186/s12864-020-07041-8.

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads.利用长读长有效地解决重复序列来组装染色体级别的 contigs。

Nat Commun. 2019 Nov 25;10(1):5360. doi: 10.1038/s41467-019-13355-3.

An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing.利用长读长单分子测序技术对火炬松超大基因组进行的改进组装。

Gigascience. 2017 Jan 1;6(1):1-4. doi: 10.1093/gigascience/giw016.

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly.LR_Gapcloser：一种基于平铺路径的缺口闭合器，它使用长读长来完成基因组组装。

Gigascience. 2019 Jan 1;8(1):giy157. doi: 10.1093/gigascience/giy157.

Genome assembly using Nanopore-guided long and error-free DNA reads.使用纳米孔引导的长且无错误的DNA reads进行基因组组装。

BMC Genomics. 2015 Apr 20;16(1):327. doi: 10.1186/s12864-015-1519-z.

HINGE: long-read assembly achieves optimal repeat resolution.HINGE：长读长组装可实现最佳的重复序列解析。

Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20.

引用本文的文献

Genomes of nitrogen-fixing eukaryotes reveal an alternate path for organellogenesis.固氮真核生物的基因组揭示了一条细胞器发生的替代途径。

Proc Natl Acad Sci U S A. 2025 Aug 19;122(33):e2507237122. doi: 10.1073/pnas.2507237122. Epub 2025 Aug 12.

The draft genome sequences of the cosmopolitan centric diatom, the genus Skeletonema.全球分布的中心硅藻——骨条藻属的基因组序列草图。

Sci Data. 2025 Aug 5;12(1):1358. doi: 10.1038/s41597-025-05432-8.

Genomic Patterns of Loss of Distyly and Polyploidization in Primroses.报春花中花柱二型性丧失和多倍体化的基因组模式

Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf162.

Rapid adaptation to a globally introduced virulent pathogen in a keystone species.关键物种对全球引入的致病性病原体的快速适应。

PNAS Nexus. 2025 Jul 8;4(7):pgaf199. doi: 10.1093/pnasnexus/pgaf199. eCollection 2025 Jul.

CRISPR-driven enhanced hydrocarbon emulsification in an environmental Pseudomonas aeruginosa strain.在环境铜绿假单胞菌菌株中由CRISPR驱动的增强型碳氢化合物乳化作用

Microb Cell Fact. 2025 Jul 2;24(1):151. doi: 10.1186/s12934-025-02769-y.

A draft genome assembly for the dart-poison frog .箭毒蛙的基因组组装草图。

GigaByte. 2025 Jun 20;2025:gigabyte157. doi: 10.46471/gigabyte.157. eCollection 2025.

Chromosome-scale genome assembly and annotation of two geographically distinct strains of malaria vector Anopheles albimanus.两种地理上不同的疟疾媒介白纹伊蚊菌株的染色体水平基因组组装与注释

Sci Rep. 2025 Jun 3;15(1):19448. doi: 10.1038/s41598-025-01713-9.

Hybrid assembly of genomes unveils high conservation of genome structural organisation and the presence of Numts in nuclear DNA.基因组的混合组装揭示了基因组结构组织的高度保守性以及核DNA中Numts的存在。

IMA Fungus. 2025 May 23;16:e145175. doi: 10.3897/imafungus.16.145175. eCollection 2025.

Draft genome assembly for the purple-hinged rock scallop (Crassadoma gigantea).紫铰岩扇贝（Crassadoma gigantea）的基因组草图组装

BMC Genom Data. 2025 May 28;26(1):39. doi: 10.1186/s12863-025-01330-5.

Metagenome-Assembled Genomes (MAGs): Advances, Challenges, and Ecological Insights.宏基因组组装基因组（MAGs）：进展、挑战与生态学见解

Microorganisms. 2025 Apr 25;13(5):985. doi: 10.3390/microorganisms13050985.

本文引用的文献

Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.Canu：通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接

Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.

Analysis of tandem gene copies in maize chromosomal regions reconstructed from long sequence reads.从长序列 reads 重建的玉米染色体区域中串联基因拷贝的分析。

Proc Natl Acad Sci U S A. 2016 Jul 19;113(29):7949-56. doi: 10.1073/pnas.1608775113. Epub 2016 Jun 27.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.Minimap和miniasm：用于有噪声长序列的快速映射和从头组装。

Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes.单个染色体的生物纳米基因组图谱有助于复杂植物基因组的物理图谱构建和序列组装。

Plant Biotechnol J. 2016 Jul;14(7):1523-31. doi: 10.1111/pbi.12513. Epub 2016 Jan 23.

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.利用单分子测序和局部敏感哈希组装大型基因组。

Nat Biotechnol. 2015 Jun;33(6):623-30. doi: 10.1038/nbt.3238. Epub 2015 May 25.

LoRDEC: accurate and efficient long read error correction.LoRDEC：准确高效的长读错误纠正。

Bioinformatics. 2014 Dec 15;30(24):3506-14. doi: 10.1093/bioinformatics/btu538. Epub 2014 Aug 26.

A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.六倍体普通小麦（Triticum aestivum）基于染色体的草图序列。

Science. 2014 Jul 18;345(6194):1251788. doi: 10.1126/science.1251788.

proovread: large-scale high-accuracy PacBio correction through iterative short read consensus.Proovread：通过迭代短读共识实现大规模高精度 PacBio 校正。

Bioinformatics. 2014 Nov 1;30(21):3004-11. doi: 10.1093/bioinformatics/btu392. Epub 2014 Jul 10.

The MaSuRCA genome assembler.马苏尔卡基因组组装器。

Bioinformatics. 2013 Nov 1;29(21):2669-77. doi: 10.1093/bioinformatics/btt476. Epub 2013 Aug 29.

Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.非杂交、基于长读长 SMRT 测序数据的完成微生物基因组组装。

Nat Methods. 2013 Jun;10(6):563-9. doi: 10.1038/nmeth.2474. Epub 2013 May 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用MaSuRCA巨读算法对面包小麦的祖先之一——[具体物种名称未给出]的大型高度重复基因组进行混合组装。

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献