LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France.
BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.
High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.
Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).
Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.
高通量测序(HTS)现在广泛用于基因组(重新)测序、宏基因组学、表观基因组学和转录组学,需要不同的、但计算机密集型的生物信息学分析。当有参考基因组时,将读取映射到其上是分析的第一步。读取映射程序的效率归功于使用涉及基因组索引数据结构,如 Burrows-Wheeler 变换。最近的解决方案使用哈希表索引基因组和读取的 k-mer 进一步提高效率和准确性。在各种情况下(例如组装或转录组分析),读取处理需要确定与给定序列相关的读取子集,这是通过在读取中搜索某些 k-mer 来完成的。目前,许多开发工作都集中在用于读取映射的基因组索引结构上,但读取索引的问题仍然广泛未被探索。然而,序列通量的增加迫切需要新的算法解决方案来有效地查询大型读取集合。
在这里,我们提出了一种名为 Gk 数组的解决方案,用于索引大型读取集合,一种构建结构的算法和查询它的过程。一旦构建,索引结构就保留在主内存中,并重复访问以回答类似“给定一个 k-mer,获取包含此 k-mer 的读取(一次/至少一次)”的查询。我们将我们的结构与其他解决方案进行了比较,这些解决方案适应了用于长文本的未压缩索引结构,并表明它快速处理查询,同时需要更少的内存。因此,我们的结构可以处理更大的读取集合。我们提供了一些示例,其中这些查询适应于不同类型的读取分析(SNP 检测、组装、RNA-Seq)。
Gk 数组是一种通用的数据结构,可在各种情况下实现快速而更准确的读取分析。Gk 数组为设计高效挖掘基因组学、表观基因组学、宏基因组学或转录组学读取的创新程序提供了一个灵活的构建块。Gk 数组库可从 http://www.atgc-montpellier.fr/ngs/ 根据 Cecill(符合 GPL)许可证获得。