Bonizzoni Paola, Vedova Gianluca Della, Pirola Yuri, Previtali Marco, Rizzi Raffaella
Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca , Milan, Italy .
J Comput Biol. 2016 Mar;23(3):137-49. doi: 10.1089/cmb.2015.0172.
The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.
在未来的应用中,如宏基因组学或癌症基因组学,需要组装大量的短读长数据,这有力地推动了对基于磁盘的方法来索引下一代测序(NGS)数据的研究。在这个方向上取得的积极成果激发了对从NGS数据进行从头组装的高效外部存储算法的研究。我们的文章还受到一个开放问题的启发,即设计一种空间高效的算法,使用基于Burrows-Wheeler变换(BWT)的索引过程来计算字符串图。我们开发了一种在外部存储中计算字符串图的基于磁盘的算法:轻量级字符串图(LSG)。LSG依赖于FM索引的一种新表示形式,利用这种表示形式可以使主内存需求与数据集大小无关。此外,我们还开发了一个用于从NGS数据进行基因组组装的流程,该流程将LSG与SGA(Simpson和Durbin,2012)的组装步骤集成在一起,SGA是一种基于字符串图的先进组装器,并使用BEETL对输入数据进行索引。LSG是开源软件,可在线获取。我们在一个包含8.75亿条读段的全基因组数据集上分析了我们的实现,在该数据集上,LSG仅使用1GB主内存构建了字符串图(相对于SGA,内存占用减少了50倍),而所需时间仅比SGA多两倍多一点。对整个流程的分析表明,内存使用量显著减少,同时运行时间仅适度增加。