Suppr超能文献

将二代测序(NGS) reads 映射到参考基因组的快速且内存高效的方法。

Fast and memory efficient approach for mapping NGS reads to a reference genome.

作者信息

Kumar Sanjeev, Agarwal Suneeta

机构信息

1 CSED, NIT Allahabad, 211004, India.

出版信息

J Bioinform Comput Biol. 2019 Apr;17(2):1950008. doi: 10.1142/S0219720019500082.

Abstract

New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows-Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .

摘要

新一代测序仪

Illumina和Solexa单次运行就能从给定的基因组序列中生成数百万条短读段。将这些读段与参考基因组进行比对是下一代测序数据分析(如遗传变异和基因组重测序等)中的核心步骤。因此,需要一种新方法,在内存和时间方面都高效,以便将这些海量读段与参考基因组进行比对。诸如MAQ、Bowtie、BWA、BWBBLE、Subread、Kart和Minimap2等现有技术在对整个参考基因组进行索引和读段比对时需要巨大内存。这些技术的带间隙比对版本也比各自的普通版本慢20% - 40%。本文提出了一种高效方法:WIT,它利用Burrows-Wheeler变换(BWT)和小波树(WT)进行参考基因组索引和读段比对。通过它可以进行精确比对和近似比对。实验工作表明,所提出的WIT方法在蛋白质序列索引方面表现最佳。对于索引,WIT所需的参考基因组空间为0.6N(N是参考基因组的大小),而现有技术BWA、Subread、Kart和Minimap2所需空间在1.25N到5N之间。实验还观察到,即使使用如此小的索引大小,所提出方法的比对时间与BWA、Subread、Kart和Minimap2相比仍具有可比性。其他比对参数准确性和保密性在实验中也显示优于Minimap2。所提出的WIT方法的源代码可在http://www.algorithm-skg.com/wit/home.html获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验