Suppr超能文献

抽样对k-mer索引的效率和准确性的影响:使用人类基因组的理论与实证比较

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

作者信息

Almutairy Meznah, Torng Eric

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.

出版信息

PLoS One. 2017 Jul 7;12(7):e0179046. doi: 10.1371/journal.pone.0179046. eCollection 2017.

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.

摘要

在序列数据库中搜索与查询序列相似的序列,最常用的方法之一是使用诸如BLAST之类的k-mer索引。k-mer索引的一个大问题是存储数据库中所有k-mer的所有出现位置列表所需的空间。一种减少所需空间以及查询时间的方法是采样,即只存储部分k-mer的出现位置。以前的大多数工作使用硬采样,即保留足够数量的k-mer出现位置,以确保能找到所有相似序列。相比之下,我们研究软采样,它以降低查询准确率为代价,进一步减少存储的k-mer出现位置数量。我们专注于在核苷酸序列上查找高度相似的局部比对(HSLA),这一操作对于诸如cDNA序列映射等生物学应用至关重要。为了进行比较,我们使用NCBI BLAST工具以及人类基因组和人类EST。在识别HSLA时,我们发现软采样在查询准确率损失相对较小的情况下,显著减少了索引大小和查询时间。对于人类基因组以及长度至少为100 bp的HSLA,软采样使索引大小比硬采样减少4至10倍,查询处理速度快2.3至6.8倍,同时仍能达到至少96.6%的保留率。当我们将软采样应用于EST与基因组的映射问题时,我们完美映射了超过98%的EST,同时将索引大小缩小了4倍,查询时间缩短了23.3%。这些结果表明,软采样是一种简单而有效的策略,可用于高效搜索HSLA。我们还提供了一种使用BLAST进行采样的新模型,该模型通过对两个关键问题因素进行建模,以合理的准确率预测经验保留率。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验