抽样对k-mer索引的效率和准确性的影响：使用人类基因组的理论与实证比较

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

作者信息

Almutairy Meznah, Torng Eric

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.

出版信息

PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.

DOI:10.1371/journal.pone.0179046

PMID:28686614

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5501444/

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.

摘要

在序列数据库中搜索与查询序列相似的序列，最常用的方法之一是使用诸如BLAST之类的k-mer索引。k-mer索引的一个大问题是存储数据库中所有k-mer的所有出现位置列表所需的空间。一种减少所需空间以及查询时间的方法是采样，即只存储部分k-mer的出现位置。以前的大多数工作使用硬采样，即保留足够数量的k-mer出现位置，以确保能找到所有相似序列。相比之下，我们研究软采样，它以降低查询准确率为代价，进一步减少存储的k-mer出现位置数量。我们专注于在核苷酸序列上查找高度相似的局部比对（HSLA），这一操作对于诸如cDNA序列映射等生物学应用至关重要。为了进行比较，我们使用NCBI BLAST工具以及人类基因组和人类EST。在识别HSLA时，我们发现软采样在查询准确率损失相对较小的情况下，显著减少了索引大小和查询时间。对于人类基因组以及长度至少为100 bp的HSLA，软采样使索引大小比硬采样减少4至10倍，查询处理速度快2.3至6.8倍，同时仍能达到至少96.6%的保留率。当我们将软采样应用于EST与基因组的映射问题时，我们完美映射了超过98%的EST，同时将索引大小缩小了4倍，查询时间缩短了23.3%。这些结果表明，软采样是一种简单而有效的策略，可用于高效搜索HSLA。我们还提供了一种使用BLAST进行采样的新模型，该模型通过对两个关键问题因素进行建模，以合理的准确率预测经验保留率。

相似文献

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

Compressed indexing and local alignment of DNA.

Bioinformatics. 2008 Mar 15;24(6):791-7. doi: 10.1093/bioinformatics/btn032. Epub 2008 Jan 28.

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

Efficient identification of DNA hybridization partners in a sequence database.

Bioinformatics. 2006 Jul 15;22(14):e350-8. doi: 10.1093/bioinformatics/btl240.

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.

BLAT--the BLAST-like alignment tool.

Genome Res. 2002 Apr;12(4):656-64. doi: 10.1101/gr.229202.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

Analysis of common k-mers for whole genome sequences using SSB-tree.

Genome Inform. 2002;13:30-41.

引用本文的文献

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

PLoS One. 2018 Feb 1;13(2):e0189960. doi: 10.1371/journal.pone.0189960. eCollection 2018.

本文引用的文献

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

On the representation of de Bruijn graphs.

J Comput Biol. 2015 May;22(5):336-52. doi: 10.1089/cmb.2014.0160. Epub 2015 Jan 28.

E-MEM: efficient computation of maximal exact matches for very large genomes.

Bioinformatics. 2015 Feb 15;31(4):509-14. doi: 10.1093/bioinformatics/btu687. Epub 2014 Oct 17.

Indexes of large genome collections on a PC.

PLoS One. 2014 Oct 7;9(10):e109384. doi: 10.1371/journal.pone.0109384. eCollection 2014.

Benchmarking short sequence mapping tools.

BMC Bioinformatics. 2013 Jun 7;14:184. doi: 10.1186/1471-2105-14-184.

Accelerating read mapping with FastHASH.

BMC Genomics. 2013;14 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-14-S1-S13. Epub 2013 Jan 21.

essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.

Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.

Exploiting sparseness in de novo genome assembly.

BMC Bioinformatics. 2012 Apr 19;13 Suppl 6(Suppl 6):S1. doi: 10.1186/1471-2105-13-S6-S1.

Hobbes: optimized gram-based methods for efficient read alignment.

Nucleic Acids Res. 2012 Mar;40(6):e41. doi: 10.1093/nar/gkr1246. Epub 2011 Dec 22.

Sensitive and fast mapping of di-base encoded reads.

Bioinformatics. 2011 Jul 15;27(14):1915-21. doi: 10.1093/bioinformatics/btr303. Epub 2011 May 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

抽样对k-mer索引的效率和准确性的影响：使用人类基因组的理论与实证比较

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献