Suppr超能文献

Strand-seq 通过期望最大化实现了通过染色体对长读段的可靠分离。

Strand-seq enables reliable separation of long reads by chromosome via expectation maximization.

机构信息

Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.

Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.

出版信息

Bioinformatics. 2018 Jul 1;34(13):i115-i123. doi: 10.1093/bioinformatics/bty290.

Abstract

MOTIVATION

Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.

RESULTS

To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.

AVAILABILITY AND IMPLEMENTATION

https://github.com/daewoooo/SaaRclust.

摘要

动机

当前的测序技术能够产生比以往任何时候都长的读取序列。这些长读取激发了从头基因组组装的新兴趣,它消除了重新测序方法固有的参考偏差,并允许对复杂的基因组变体进行直接表征。然而,即使使用最新的算法进展,从易错的长读取组装哺乳动物基因组也会带来巨大的计算负担,并且不能排除偶尔的组装错误。如果可以分别为每个染色体开始组装,这两个问题都可能得到缓解。

结果

为了解决这个问题,我们展示了如何为此目的利用单细胞模板链测序(Strand-seq)数据。我们引入了一种新的潜在变量模型和相应的期望最大化算法,称为 SaaRclust,并展示了它能够可靠地按染色体对长读取进行聚类的能力。对于每个长读取,该方法都会生成一个起源染色体和读取方向的后验概率分布。通过这种方式,它可以评估在单个读取水平上稀疏 Strand-seq 数据固有的不确定性程度。在我们的算法自信地分配给染色体的读取中,我们观察到在具有 30.1×覆盖度的太平洋生物科学读取子集上,超过 99%的读取分配是正确的。据我们所知,SaaRclust 是在组装之前通过染色体对长读取进行虚拟分离的第一种方法。

可用性和实现

https://github.com/daewoooo/SaaRclust。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ddd/6022540/61175c623079/bty290f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验