用于在序列数据库同源性搜索工具PSI-BLAST中去除近乎相同匹配项的子例程的代码优化。

Code optimization of the subroutine to remove near identical matches in the sequence database homology search tool PSI-BLAST.

作者信息

Aspnäs Mats, Mattila Kimmo, Osowski Kristoffer, Westerholm Jan

机构信息

Department of Information Technologies, Abo Akademi University, Abo, Finland.

出版信息

J Comput Biol. 2010 Jun;17(6):819-23. doi: 10.1089/cmb.2008.0053.

DOI:10.1089/cmb.2008.0053

PMID:20583927

Abstract

A central task in protein sequence characterization is the use of a sequence database homology search tool to find similar protein sequences in other individuals or species. PSI-BLAST is a widely used module of the BLAST package that calculates a position-specific score matrix from the best matching sequences and performs iterated searches using a method to avoid many similar sequences for the score. For some queries and parameter settings, PSI-BLAST may find many similar high-scoring matches, and therefore up to 80% of the total run time may be spent in this procedure. In this article, we present code optimizations that improve the cache utilization and the overall performance of this procedure. Measurements show that, for queries where the number of similar matches is high, the optimized PSI-BLAST program may be as much as 2.9 times faster than the original program.

摘要

蛋白质序列特征描述中的一项核心任务是使用序列数据库同源性搜索工具，在其他个体或物种中查找相似的蛋白质序列。PSI-BLAST是BLAST软件包中一个广泛使用的模块，它根据最佳匹配序列计算位置特异性得分矩阵，并使用一种方法进行迭代搜索，以避免许多相似序列参与得分计算。对于某些查询和参数设置，PSI-BLAST可能会找到许多相似的高得分匹配项，因此在这个过程中可能会花费高达总运行时间80%的时间。在本文中，我们提出了一些代码优化方法，这些方法提高了缓存利用率和该过程的整体性能。测量结果表明，对于相似匹配数量较多的查询，优化后的PSI-BLAST程序可能比原始程序快2.9倍。