Department of Computer Science, Tufts University, Medford, MA 02451, USA.
Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.
The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.
We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.
CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/
蛋白质序列数据库的指数级增长使得搜索同源物的基本问题成为计算瓶颈。然而,独特数据的数量增长速度却并不快;我们可以利用这一事实来大大加速同源搜索。流行的 PSI/DELTA-BLAST 工具家族中的程序加速不仅会直接加速同源搜索,还会加速其他目前主要通过这些工具与大型蛋白质数据库交互的大量现有程序。
我们引入了一套同源搜索工具,由压缩加速的蛋白质 BLAST(CaBLASTP)提供支持,这些工具比所有已知的最先进工具,包括 HHblits、DELTA-BLAST 和 PSI-BLAST 更快、更准确。此外,我们的工具以允许直接替换到现有分析管道的方式实现。关键思想是我们引入了一种基于局部相似性的压缩方案,使我们能够直接对压缩数据进行操作。重要的是,CaBLASTP 的运行时间几乎与独特数据的数量呈线性比例,而不是当前的 BLASTP 变体,后者与正在搜索的完整蛋白质数据库的大小呈线性比例。我们的压缩算法将加速许多任务,如蛋白质结构预测和同源映射,这些任务严重依赖于同源搜索。
CaBLASTP 可在 http://cablastp.csail.mit.edu/ 下根据 GNU 公共许可证获得。