使用合成长读长进行非参考 DNA 序列的高效检测和组装。

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads.

机构信息

Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, NY 10021, USA.

Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, NY 10021, USA.

出版信息

Nucleic Acids Res. 2022 Oct 14;50(18):e108. doi: 10.1093/nar/gkac653.

DOI:10.1093/nar/gkac653

PMID:35924489

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9561269/

Abstract

Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.

摘要

最近的全基因组研究揭示了人类基因组中存在大量参考基因组中不存在的 DNA 序列。这些非参考序列（NRSs）中很大一部分无法可靠地组装或定位到参考基因组上。长读长和合成长读（又名链接读）技术的改进对于 NRSs 的特征描述具有巨大的潜力。虽然合成长读需要的输入 DNA 比长读数据集少，但在算法上使用起来更具挑战性。除了计算成本高昂的全基因组组装方法外，目前还没有用于 NRS 检测的合成长读方法。我们提出了一种新颖的基于整合比对和局部组装的算法 Novel-X，该算法利用合成长读中编码的条形码信息来改进此类事件的检测，而无需进行全基因组从头组装。我们的评估表明，Novel-X 可以发现许多无法通过最先进的短读方法找到的非参考序列。我们将 Novel-X 应用于来自 Polaris HiSeq 4000 PGx 队列的 68 个多样化样本。Novel-X 发现了 16691 个大小大于 300bp 的 NRS 插入（总长度为 18.2Mb）。其中许多是特定于人群的，或者可能具有功能影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6449/9561269/c2461a4ce3e5/gkac653fig1.jpg

相似文献

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads.使用合成长读长进行非参考 DNA 序列的高效检测和组装。

Nucleic Acids Res. 2022 Oct 14;50(18):e108. doi: 10.1093/nar/gkac653.

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.ARKS：基于链接读取子的人类基因组草图染色体级 scaffolding。

BMC Bioinformatics. 2018 Jun 20;19(1):234. doi: 10.1186/s12859-018-2243-x.

A comprehensive investigation of metagenome assembly by linked-read sequencing.基于链接读取测序的宏基因组组装综合研究。

Microbiome. 2020 Nov 11;8(1):156. doi: 10.1186/s40168-020-00929-3.

MTG-Link: leveraging barcode information from linked-reads to assemble specific loci.MTG-Link：利用来自链接读取的条形码信息来组装特定的基因座。

BMC Bioinformatics. 2023 Jul 14;24(1):284. doi: 10.1186/s12859-023-05395-w.

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements.Illumina TruSeq合成长读段技术助力从头组装，并解析复杂的、高度重复的转座元件。

PLoS One. 2014 Sep 4;9(9):e106689. doi: 10.1371/journal.pone.0106689. eCollection 2014.

Pseudo-Sanger sequencing: massively parallel production of long and near error-free reads using NGS technology.伪桑格测序：使用下一代测序（NGS）技术大规模并行产生长且近乎无错误的 reads。

BMC Genomics. 2013 Oct 17;14(1):711. doi: 10.1186/1471-2164-14-711.

Linked read technology for assembling large complex and polyploid genomes.链接读取技术用于组装大型复杂和多倍体基因组。

BMC Genomics. 2018 Sep 4;19(1):651. doi: 10.1186/s12864-018-5040-z.

Fast-SG: an alignment-free algorithm for hybrid assembly.Fast-SG：一种用于混合组装的无比对算法。

Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy048.

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.精确的圆形共识长读测序提高了人类基因组变异检测和组装的准确性。

Nat Biotechnol. 2019 Oct;37(10):1155-1162. doi: 10.1038/s41587-019-0217-9. Epub 2019 Aug 12.

HISEA: HIerarchical SEed Aligner for PacBio data.HISEA：用于PacBio数据的分层种子比对器。

BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.

引用本文的文献

Blackbird: structural variant detection using synthetic and low-coverage long-reads.黑鹂：利用合成和低覆盖度长读段进行结构变异检测

Bioinform Adv. 2025 Jul 4;5(1):vbaf151. doi: 10.1093/bioadv/vbaf151. eCollection 2025.

Blackbird: structural variant detection using synthetic and low-coverage long-reads.黑鹂：利用合成和低覆盖度长读段进行结构变异检测

bioRxiv. 2024 Nov 18:2024.11.17.624011. doi: 10.1101/2024.11.17.624011.

Technology-enabled great leap in deciphering plant genomes.技术助力植物基因组破译实现巨大飞跃。

Nat Plants. 2024 Apr;10(4):551-566. doi: 10.1038/s41477-024-01655-6. Epub 2024 Mar 20.

Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles.人类泛基因组分析缺失参考基因组序列揭示了它们广泛的进化、表型和功能作用。

Nucleic Acids Res. 2024 Mar 21;52(5):2212-2230. doi: 10.1093/nar/gkae086.

BLR: a flexible pipeline for haplotype analysis of multiple linked-read technologies.BLR：一种用于多种链接读取技术的单倍型分析的灵活管道。

Nucleic Acids Res. 2023 Dec 11;51(22):e114. doi: 10.1093/nar/gkad1010.

本文引用的文献

The complete sequence of a human genome.人类基因组的完整序列。

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

Population-scale detection of non-reference sequence variants using colored de Bruijn graphs.使用有色 de Bruijn 图进行大规模人群中非参考序列变异的检测。

Bioinformatics. 2022 Jan 12;38(3):604-611. doi: 10.1093/bioinformatics/btab749.

diploid genome assembly for genome-wide structural variant detection.用于全基因组结构变异检测的二倍体基因组组装

NAR Genom Bioinform. 2019 Dec 6;2(1):lqz018. doi: 10.1093/nargab/lqz018. eCollection 2020 Mar.

Telomere-to-telomere assembly of a complete human X chromosome.端粒到端粒组装完整的人类 X 染色体。

Nature. 2020 Sep;585(7823):79-84. doi: 10.1038/s41586-020-2547-7. Epub 2020 Jul 14.

A robust benchmark for detection of germline large deletions and insertions.一种用于检测种系大片段缺失和插入的稳健基准

Nat Biotechnol. 2020 Nov;38(11):1347-1355. doi: 10.1038/s41587-020-0538-8. Epub 2020 Jun 15.

Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information.超微量单管连接读取文库方法使短读长第二代测序系统能够常规地生成高度准确和经济的长程测序信息。

Genome Res. 2020 Jun;30(6):898-909. doi: 10.1101/gr.260380.119. Epub 2020 Jun 15.

VALOR2: characterization of large-scale structural variants using linked-reads.VALOR2：利用连接读取技术进行大规模结构变异的特征描述。

Genome Biol. 2020 Mar 19;21(1):72. doi: 10.1186/s13059-020-01975-8.

LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data.LinkedSV 用于检测来自连锁读取外显子组和基因组测序数据的嵌合结构变体。

Nat Commun. 2019 Dec 6;10(1):5585. doi: 10.1038/s41467-019-13397-7.

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs.cloudSPAdes：基于 de Bruijn 图的合成长读段组装

Bioinformatics. 2019 Jul 15;35(14):i61-i70. doi: 10.1093/bioinformatics/btz349.

Multi-platform discovery of haplotype-resolved structural variation in human genomes.多平台发现人类基因组中单体型分辨率结构变异。

Nat Commun. 2019 Apr 16;10(1):1784. doi: 10.1038/s41467-018-08148-z.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用合成长读长进行非参考 DNA 序列的高效检测和组装。

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献