Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt.
Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt.
Bioinformatics. 2016 Nov 1;32(21):3215-3223. doi: 10.1093/bioinformatics/btw470. Epub 2016 Jul 13.
The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory.
LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage.
https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online.
自下一代测序 (NGS) 技术发明以来,当前测序数据的洪流已经超过了摩尔定律,每两年翻一番还不止。因此,我们将能够以固定成本以更高的速度生成越来越多的数据,但缺乏存储、处理和分析数据的计算资源。由于高通量 NGS 读取和基因组重复存在错误,组装图包含大量冗余节点和分支边缘。大多数组装管道都需要将这个大型图驻留在内存中才能开始其工作流程,但对于哺乳动物基因组来说,这是难以处理的。资源高效的基因组组装器结合了先进计算技术的强大功能和创新的数据结构,以便在计算机内存中有效地对组装图进行编码。
LightAssembler 是一种轻量级的组装算法,旨在在台式机上执行。它使用一对无缓存感知布隆过滤器,一个持有均匀采样的 [Formula: see text]-间隔测序 [Formula: see text]-mers,另一个持有 [Formula: see text]-mers 被分类为可能正确的,使用简单的统计测试。LightAssembler 包含一个轻量级的图遍历和简化模块实现,与其他竞争工具相比,实现了相当的组装准确性和连续性。我们的方法使用 GAGE 和 Assemblathon 项目的基准数据集,与资源高效的组装器相比,内存使用量减少了 [Formula: see text]。虽然 LightAssembler 可以被认为是基于缺口的序列组装器,但不同的缺口大小导致几乎恒定的组装大小和基因组覆盖度。
https://github.com/SaraEl-Metwally/LightAssembler 联系信息:sarah_almetwally4@mans.edu.eg 补充信息:补充数据可在 Bioinformatics 在线获得。