在主内存中查询大型读取集合：一种通用的数据结构。

Querying large read collections in main memory: a versatile data structure.

机构信息

LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France.

出版信息

BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.

DOI:10.1186/1471-2105-12-242

PMID:21682852

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3163563/

Abstract

BACKGROUND

High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.

RESULTS

Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).

CONCLUSIONS

Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.

摘要

背景

高通量测序（HTS）现在广泛用于基因组（重新）测序、宏基因组学、表观基因组学和转录组学，需要不同的、但计算机密集型的生物信息学分析。当有参考基因组时，将读取映射到其上是分析的第一步。读取映射程序的效率归功于使用涉及基因组索引数据结构，如 Burrows-Wheeler 变换。最近的解决方案使用哈希表索引基因组和读取的 k-mer 进一步提高效率和准确性。在各种情况下（例如组装或转录组分析），读取处理需要确定与给定序列相关的读取子集，这是通过在读取中搜索某些 k-mer 来完成的。目前，许多开发工作都集中在用于读取映射的基因组索引结构上，但读取索引的问题仍然广泛未被探索。然而，序列通量的增加迫切需要新的算法解决方案来有效地查询大型读取集合。

结果

在这里，我们提出了一种名为 Gk 数组的解决方案，用于索引大型读取集合，一种构建结构的算法和查询它的过程。一旦构建，索引结构就保留在主内存中，并重复访问以回答类似“给定一个 k-mer，获取包含此 k-mer 的读取（一次/至少一次）”的查询。我们将我们的结构与其他解决方案进行了比较，这些解决方案适应了用于长文本的未压缩索引结构，并表明它快速处理查询，同时需要更少的内存。因此，我们的结构可以处理更大的读取集合。我们提供了一些示例，其中这些查询适应于不同类型的读取分析（SNP 检测、组装、RNA-Seq）。

结论

Gk 数组是一种通用的数据结构，可在各种情况下实现快速而更准确的读取分析。Gk 数组为设计高效挖掘基因组学、表观基因组学、宏基因组学或转录组学读取的创新程序提供了一个灵活的构建块。Gk 数组库可从 http://www.atgc-montpellier.fr/ngs/ 根据 Cecill（符合 GPL）许可证获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/5f4aa69fcea0/1471-2105-12-242-1.jpg

相似文献

Querying large read collections in main memory: a versatile data structure.

BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.

Ψ-RA: a parallel sparse index for genomic read alignment.

BMC Genomics. 2011;12 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2164-12-S2-S7. Epub 2011 Jul 27.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.

Fast and memory efficient approach for mapping NGS reads to a reference genome.

J Bioinform Comput Biol. 2019 Apr;17(2):1950008. doi: 10.1142/S0219720019500082.

Indexing Arbitrary-Length k-Mers in Sequencing Reads.

PLoS One. 2015 Jul 16;10(7):e0133198. doi: 10.1371/journal.pone.0133198. eCollection 2015.

Compact representation of k-mer de Bruijn graphs for genome read assembly.

BMC Bioinformatics. 2013 Oct 23;14:313. doi: 10.1186/1471-2105-14-313.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

QuorUM: An Error Corrector for Illumina Reads.

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper.

PLoS Comput Biol. 2021 Mar 22;17(3):e1008815. doi: 10.1371/journal.pcbi.1008815. eCollection 2021 Mar.

引用本文的文献

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.

BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):260. doi: 10.1186/s12859-020-03586-3.

SNPs detection by eBWT positional clustering.

Algorithms Mol Biol. 2019 Feb 6;14:3. doi: 10.1186/s13015-019-0137-8. eCollection 2019.

Indexing Arbitrary-Length k-Mers in Sequencing Reads.

PLoS One. 2015 Jul 16;10(7):e0133198. doi: 10.1371/journal.pone.0133198. eCollection 2015.

CRAC: an integrated approach to the analysis of RNA-seq reads.

Genome Biol. 2013 Mar 28;14(3):R30. doi: 10.1186/gb-2013-14-3-r30.

本文引用的文献

Correcting errors in short reads by multiple alignments.

Bioinformatics. 2011 Jun 1;27(11):1455-61. doi: 10.1093/bioinformatics/btr170. Epub 2011 Apr 5.

Succinct data structures for assembling large genomes.

Bioinformatics. 2011 Feb 15;27(4):479-86. doi: 10.1093/bioinformatics/btq697. Epub 2011 Jan 17.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

HiTEC: accurate error correction in high-throughput sequencing data.

Bioinformatics. 2011 Feb 1;27(3):295-302. doi: 10.1093/bioinformatics/btq653. Epub 2010 Nov 26.

mrsFAST: a cache-oblivious algorithm for short-read mapping.

Nat Methods. 2010 Aug;7(8):576-7. doi: 10.1038/nmeth0810-576.

A survey of sequence alignment algorithms for next-generation sequencing.

Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Nat Biotechnol. 2010 May;28(5):511-5. doi: 10.1038/nbt.1621. Epub 2010 May 2.

Correction of sequencing errors in a mixed set of reads.

Bioinformatics. 2010 May 15;26(10):1284-90. doi: 10.1093/bioinformatics/btq151. Epub 2010 Apr 8.

Assembly algorithms for next-generation sequencing data.

Genomics. 2010 Jun;95(6):315-27. doi: 10.1016/j.ygeno.2010.03.001. Epub 2010 Mar 6.

Complete Khoisan and Bantu genomes from southern Africa.

Nature. 2010 Feb 18;463(7283):943-7. doi: 10.1038/nature08795.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在主内存中查询大型读取集合：一种通用的数据结构。

Querying large read collections in main memory: a versatile data structure.

机构信息

LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France.