• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在主内存中查询大型读取集合:一种通用的数据结构。

Querying large read collections in main memory: a versatile data structure.

机构信息

LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France.

出版信息

BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.

DOI:10.1186/1471-2105-12-242
PMID:21682852
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3163563/
Abstract

BACKGROUND

High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.

RESULTS

Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).

CONCLUSIONS

Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.

摘要

背景

高通量测序(HTS)现在广泛用于基因组(重新)测序、宏基因组学、表观基因组学和转录组学,需要不同的、但计算机密集型的生物信息学分析。当有参考基因组时,将读取映射到其上是分析的第一步。读取映射程序的效率归功于使用涉及基因组索引数据结构,如 Burrows-Wheeler 变换。最近的解决方案使用哈希表索引基因组和读取的 k-mer 进一步提高效率和准确性。在各种情况下(例如组装或转录组分析),读取处理需要确定与给定序列相关的读取子集,这是通过在读取中搜索某些 k-mer 来完成的。目前,许多开发工作都集中在用于读取映射的基因组索引结构上,但读取索引的问题仍然广泛未被探索。然而,序列通量的增加迫切需要新的算法解决方案来有效地查询大型读取集合。

结果

在这里,我们提出了一种名为 Gk 数组的解决方案,用于索引大型读取集合,一种构建结构的算法和查询它的过程。一旦构建,索引结构就保留在主内存中,并重复访问以回答类似“给定一个 k-mer,获取包含此 k-mer 的读取(一次/至少一次)”的查询。我们将我们的结构与其他解决方案进行了比较,这些解决方案适应了用于长文本的未压缩索引结构,并表明它快速处理查询,同时需要更少的内存。因此,我们的结构可以处理更大的读取集合。我们提供了一些示例,其中这些查询适应于不同类型的读取分析(SNP 检测、组装、RNA-Seq)。

结论

Gk 数组是一种通用的数据结构,可在各种情况下实现快速而更准确的读取分析。Gk 数组为设计高效挖掘基因组学、表观基因组学、宏基因组学或转录组学读取的创新程序提供了一个灵活的构建块。Gk 数组库可从 http://www.atgc-montpellier.fr/ngs/ 根据 Cecill(符合 GPL)许可证获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/d6b854767237/1471-2105-12-242-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/5f4aa69fcea0/1471-2105-12-242-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/9d3b5b69eda5/1471-2105-12-242-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/87a6e4aacb00/1471-2105-12-242-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/040be148f539/1471-2105-12-242-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/1a9c844450cd/1471-2105-12-242-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/205fa951c335/1471-2105-12-242-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/04775cf331fb/1471-2105-12-242-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/d6b854767237/1471-2105-12-242-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/5f4aa69fcea0/1471-2105-12-242-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/9d3b5b69eda5/1471-2105-12-242-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/87a6e4aacb00/1471-2105-12-242-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/040be148f539/1471-2105-12-242-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/1a9c844450cd/1471-2105-12-242-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/205fa951c335/1471-2105-12-242-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/04775cf331fb/1471-2105-12-242-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e89/3163563/d6b854767237/1471-2105-12-242-8.jpg

相似文献

1
Querying large read collections in main memory: a versatile data structure.在主内存中查询大型读取集合:一种通用的数据结构。
BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.
2
Ψ-RA: a parallel sparse index for genomic read alignment.Ψ-RA:一种用于基因组读取比对的并行稀疏索引。
BMC Genomics. 2011;12 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2164-12-S2-S7. Epub 2011 Jul 27.
3
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.Kmerind:一种用于分布式内存系统上生物序列的 K-mer 索引的灵活并行库。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.
4
Fast and memory efficient approach for mapping NGS reads to a reference genome.将二代测序(NGS) reads 映射到参考基因组的快速且内存高效的方法。
J Bioinform Comput Biol. 2019 Apr;17(2):1950008. doi: 10.1142/S0219720019500082.
5
Indexing Arbitrary-Length k-Mers in Sequencing Reads.对测序读段中的任意长度k-mer进行索引
PLoS One. 2015 Jul 16;10(7):e0133198. doi: 10.1371/journal.pone.0133198. eCollection 2015.
6
Compact representation of k-mer de Bruijn graphs for genome read assembly.用于基因组读取组装的 k-mer de Bruijn 图的紧凑表示。
BMC Bioinformatics. 2013 Oct 23;14:313. doi: 10.1186/1471-2105-14-313.
7
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
8
QuorUM: An Error Corrector for Illumina Reads.QuorUM:Illumina测序读数的纠错工具
PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.
9
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
10
Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper.使用 FlexTyper 对索引短读取进行灵活序列查询的实用性展示。
PLoS Comput Biol. 2021 Mar 22;17(3):e1008815. doi: 10.1371/journal.pcbi.1008815. eCollection 2021 Mar.

引用本文的文献

1
Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.基于 Burrows-Wheeler 变换的变阶无参考变异发现。
BMC Bioinformatics. 2020 Sep 16;21(Suppl 8):260. doi: 10.1186/s12859-020-03586-3.
2
SNPs detection by eBWT positional clustering.通过增强型Burrows-Wheeler变换(eBWT)位置聚类进行单核苷酸多态性(SNP)检测。
Algorithms Mol Biol. 2019 Feb 6;14:3. doi: 10.1186/s13015-019-0137-8. eCollection 2019.
3
Indexing Arbitrary-Length k-Mers in Sequencing Reads.对测序读段中的任意长度k-mer进行索引

本文引用的文献

1
Correcting errors in short reads by multiple alignments.通过多次比对纠正短读中的错误。
Bioinformatics. 2011 Jun 1;27(11):1455-61. doi: 10.1093/bioinformatics/btr170. Epub 2011 Apr 5.
2
Succinct data structures for assembling large genomes.用于组装大型基因组的简明数据结构。
Bioinformatics. 2011 Feb 15;27(4):479-86. doi: 10.1093/bioinformatics/btq697. Epub 2011 Jan 17.
3
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.一种快速、无锁的方法,用于高效并行计数 k-mer 的出现次数。
PLoS One. 2015 Jul 16;10(7):e0133198. doi: 10.1371/journal.pone.0133198. eCollection 2015.
4
CRAC: an integrated approach to the analysis of RNA-seq reads.CRAC:一种用于RNA测序读数分析的综合方法。
Genome Biol. 2013 Mar 28;14(3):R30. doi: 10.1186/gb-2013-14-3-r30.
Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.
4
HiTEC: accurate error correction in high-throughput sequencing data.HiTEC:高通量测序数据中的精确错误校正。
Bioinformatics. 2011 Feb 1;27(3):295-302. doi: 10.1093/bioinformatics/btq653. Epub 2010 Nov 26.
5
mrsFAST: a cache-oblivious algorithm for short-read mapping.mrsFAST:一种用于短读段映射的缓存无关算法。
Nat Methods. 2010 Aug;7(8):576-7. doi: 10.1038/nmeth0810-576.
6
A survey of sequence alignment algorithms for next-generation sequencing.下一代测序序列比对算法综述。
Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11.
7
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.通过 RNA-Seq 进行转录本组装和定量分析揭示了细胞分化过程中未注释的转录本和异构体转换。
Nat Biotechnol. 2010 May;28(5):511-5. doi: 10.1038/nbt.1621. Epub 2010 May 2.
8
Correction of sequencing errors in a mixed set of reads.纠正混合读取集中的测序错误。
Bioinformatics. 2010 May 15;26(10):1284-90. doi: 10.1093/bioinformatics/btq151. Epub 2010 Apr 8.
9
Assembly algorithms for next-generation sequencing data.下一代测序数据的组装算法。
Genomics. 2010 Jun;95(6):315-27. doi: 10.1016/j.ygeno.2010.03.001. Epub 2010 Mar 6.
10
Complete Khoisan and Bantu genomes from southern Africa.完成来自南非的科伊桑和班图人的全基因组。
Nature. 2010 Feb 18;463(7283):943-7. doi: 10.1038/nature08795.