Strand-seq 通过期望最大化实现了通过染色体对长读段的可靠分离。

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

机构信息

Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.

Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.

出版信息

Bioinformatics. 2018 Jul 1;34(13):i115-i123. doi: 10.1093/bioinformatics/bty290.

DOI:10.1093/bioinformatics/bty290

PMID:29949971

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6022540/

Abstract

MOTIVATION

Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.

RESULTS

To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.

AVAILABILITY AND IMPLEMENTATION

https://github.com/daewoooo/SaaRclust.

摘要

动机

当前的测序技术能够产生比以往任何时候都长的读取序列。这些长读取激发了从头基因组组装的新兴趣，它消除了重新测序方法固有的参考偏差，并允许对复杂的基因组变体进行直接表征。然而，即使使用最新的算法进展，从易错的长读取组装哺乳动物基因组也会带来巨大的计算负担，并且不能排除偶尔的组装错误。如果可以分别为每个染色体开始组装，这两个问题都可能得到缓解。

结果

为了解决这个问题，我们展示了如何为此目的利用单细胞模板链测序（Strand-seq）数据。我们引入了一种新的潜在变量模型和相应的期望最大化算法，称为 SaaRclust，并展示了它能够可靠地按染色体对长读取进行聚类的能力。对于每个长读取，该方法都会生成一个起源染色体和读取方向的后验概率分布。通过这种方式，它可以评估在单个读取水平上稀疏 Strand-seq 数据固有的不确定性程度。在我们的算法自信地分配给染色体的读取中，我们观察到在具有 30.1×覆盖度的太平洋生物科学读取子集上，超过 99%的读取分配是正确的。据我们所知，SaaRclust 是在组装之前通过染色体对长读取进行虚拟分离的第一种方法。

可用性和实现

https://github.com/daewoooo/SaaRclust。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ddd/6022540/61175c623079/bty290f1.jpg

相似文献

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.Strand-seq 通过期望最大化实现了通过染色体对长读段的可靠分离。

Bioinformatics. 2018 Jul 1;34(13):i115-i123. doi: 10.1093/bioinformatics/bty290.

ReMILO: reference assisted misassembly detection algorithm using short and long reads.ReMILO：使用短读长读的参考辅助错误组装检测算法。

Bioinformatics. 2018 Jan 1;34(1):24-32. doi: 10.1093/bioinformatics/btx524.

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.ARKS：基于链接读取子的人类基因组草图染色体级 scaffolding。

BMC Bioinformatics. 2018 Jun 20;19(1):234. doi: 10.1186/s12859-018-2243-x.

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.单轮循环器：从短读长和长读长测序数据中解析细菌基因组组装结果

PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. eCollection 2017 Jun.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

Arioc: GPU-accelerated alignment of short bisulfite-treated reads.Arioc：用于短亚硫酸氢盐处理读取物的 GPU 加速对齐。

Bioinformatics. 2018 Aug 1;34(15):2673-2675. doi: 10.1093/bioinformatics/bty167.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.NeatFreq：用于从头序列组装的无参考数据缩减和覆盖度归一化

BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.

ARCS: scaffolding genome drafts with linked reads.ARCS：使用链接读取构建基因组草图。

Bioinformatics. 2018 Mar 1;34(5):725-731. doi: 10.1093/bioinformatics/btx675.

QuorUM: An Error Corrector for Illumina Reads.QuorUM：Illumina测序读数的纠错工具

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Discovery and genotyping of novel sequence insertions in many sequenced individuals.在许多测序个体中发现和基因分型新的序列插入。

Bioinformatics. 2017 Jul 15;33(14):i161-i169. doi: 10.1093/bioinformatics/btx254.

引用本文的文献

Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing.Graphasing：利用单细胞测序进行二倍体基因组组装图谱的相位分析。

Genome Biol. 2024 Oct 10;25(1):265. doi: 10.1186/s13059-024-03409-1.

Phasing Diploid Genome Assembly Graphs with Single-Cell Strand Sequencing.利用单细胞链测序对二倍体基因组组装图进行定相

bioRxiv. 2024 Jun 20:2024.02.15.580432. doi: 10.1101/2024.02.15.580432.

Structurally divergent and recurrently mutated regions of primate genomes.灵长类基因组结构上不同且反复突变的区域。

Cell. 2024 Mar 14;187(6):1547-1562.e13. doi: 10.1016/j.cell.2024.01.052. Epub 2024 Feb 29.

Decoding the fibromelanosis locus complex chromosomal rearrangement of black-bone chicken: genetic differentiation, selective sweeps and protein-coding changes in Kadaknath chicken.解码乌骨鸡纤维黑素沉着基因座复杂染色体重排：卡达卡纳特鸡的遗传分化、选择性清除和蛋白质编码变化

Front Genet. 2023 Jun 22;14:1180658. doi: 10.3389/fgene.2023.1180658. eCollection 2023.

Telomere-to-telomere assembly of diploid chromosomes with Verkko.利用 Verkko 进行二倍体染色体的端粒到端粒组装。

Nat Biotechnol. 2023 Oct;41(10):1474-1482. doi: 10.1038/s41587-023-01662-6. Epub 2023 Feb 16.

Semi-automated assembly of high-quality diploid human reference genomes.半自动组装高质量的二倍体人类参考基因组。

Nature. 2022 Nov;611(7936):519-531. doi: 10.1038/s41586-022-05325-5. Epub 2022 Oct 19.

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes.基于泛基因组的基因组推断可在广泛的变异类别中实现高效、准确的基因分型。

Nat Genet. 2022 Apr;54(4):518-525. doi: 10.1038/s41588-022-01043-w. Epub 2022 Apr 11.

The complete sequence of a human genome.人类基因组的完整序列。

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

The structure, function and evolution of a complete human chromosome 8.完整人类 8 号染色体的结构、功能与进化

Nature. 2021 May;593(7857):101-107. doi: 10.1038/s41586-021-03420-7. Epub 2021 Apr 7.

Haplotype-resolved diverse human genomes and integrated analysis of structural variation.单体型解析的多样化人类基因组和结构变异的综合分析。

Science. 2021 Apr 2;372(6537). doi: 10.1126/science.abf7117. Epub 2021 Feb 25.

本文引用的文献

Construction of Whole Genomes from Scaffolds Using Single Cell Strand-Seq Data.使用单细胞链测序数据从支架构建全基因组。

Int J Mol Sci. 2021 Mar 31;22(7):3617. doi: 10.3390/ijms22073617.

Multi-platform discovery of haplotype-resolved structural variation in human genomes.多平台发现人类基因组中单体型分辨率结构变异。

Nat Commun. 2019 Apr 16;10(1):1784. doi: 10.1038/s41467-018-08148-z.

BLM helicase suppresses recombination at G-quadruplex motifs in transcribed genes.BLM解旋酶抑制转录基因中G-四链体基序处的重组。

Nat Commun. 2018 Jan 18;9(1):271. doi: 10.1038/s41467-017-02760-1.

Genome-wide mapping of sister chromatid exchange events in single yeast cells using Strand-seq.使用 Strand-seq 技术在单个酵母细胞中进行姐妹染色单体交换事件的全基因组图谱绘制。

Elife. 2017 Dec 12;6:e30560. doi: 10.7554/eLife.30560.

Dense and accurate whole-chromosome haplotyping of individual genomes.个体基因组的密集且精确的全染色体单倍型分型。

Nat Commun. 2017 Nov 3;8(1):1293. doi: 10.1038/s41467-017-01389-4.

Assembling draft genomes using contiBAIT.使用 contiBAIT 组装草图基因组。

Bioinformatics. 2017 Sep 1;33(17):2737-2739. doi: 10.1093/bioinformatics/btx281.

Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.Canu：通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接

Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.

The impact of third generation genomic technologies on plant genome assembly.第三代基因组技术对植物基因组组装的影响。

Curr Opin Plant Biol. 2017 Apr;36:64-70. doi: 10.1016/j.pbi.2017.02.002. Epub 2017 Feb 21.

Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data.利用光学图谱和染色体构象捕获数据改进和校正三种植物物种长读长基因组组装的连续性

Genome Res. 2017 May;27(5):778-786. doi: 10.1101/gr.213652.116. Epub 2017 Feb 3.

Assembly of long error-prone reads using de Bruijn graphs.使用德布鲁因图组装长易错读段。

Proc Natl Acad Sci U S A. 2016 Dec 27;113(52):E8396-E8405. doi: 10.1073/pnas.1604560113. Epub 2016 Dec 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Strand-seq 通过期望最大化实现了通过染色体对长读段的可靠分离。

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献