MAP：搜索大型基因组数据库。

MAP: searching large genome databases.

作者信息

Kahveci Tamer, Singh Ambuj

机构信息

Department of Computer Science, University of California, Santa Barbara, CA 93106, USA.

出版信息

Pac Symp Biocomput. 2003:303-14.

PMID:12603037

Abstract

A number of biological applications require comparison of large genome strings. Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets. We propose an efficient technique for alignment of large genome strings. Our technique precomputes the associations between the database strings and the query string. These associations are used to prune the database-query substring pairs that do not contain similar regions. We use a hash table to compare the unpruned regions of the query and database strings. The cost of the ensuing search is determined by how the hash table is constructed. We present a dynamic strategy that optimizes the random disk I/O needed for accessing the hash table. It also provides the user a coarse grain visualization of the similarity pattern quickly before the actual search. The experimental results show that our technique aligns genome strings up to 97 times faster than BLAST.

摘要

许多生物学应用需要对大型基因组字符串进行比较。由于内存需求大且候选集庞大，当前技术存在磁盘I/O和计算成本问题。我们提出了一种用于大型基因组字符串比对的高效技术。我们的技术预先计算数据库字符串与查询字符串之间的关联。这些关联用于修剪不包含相似区域的数据库-查询子字符串对。我们使用哈希表来比较查询字符串和数据库字符串的未修剪区域。后续搜索的成本取决于哈希表的构建方式。我们提出了一种动态策略，可优化访问哈希表所需的随机磁盘I/O。它还能在实际搜索之前快速为用户提供相似性模式的粗粒度可视化。实验结果表明，我们的技术比对基因组字符串的速度比BLAST快97倍。