Suppr超能文献

基于压缩的基因组学蛋白质数据库。

Compressive genomics for protein databases.

机构信息

Department of Computer Science, Tufts University, Medford, MA 02451, USA.

出版信息

Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.

Abstract

MOTIVATION

The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.

RESULTS

We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.

AVAILABILITY

CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/

CONTACT

bab@mit.edu.

摘要

动机

蛋白质序列数据库的指数级增长使得搜索同源物的基本问题成为计算瓶颈。然而,独特数据的数量增长速度却并不快;我们可以利用这一事实来大大加速同源搜索。流行的 PSI/DELTA-BLAST 工具家族中的程序加速不仅会直接加速同源搜索,还会加速其他目前主要通过这些工具与大型蛋白质数据库交互的大量现有程序。

结果

我们引入了一套同源搜索工具,由压缩加速的蛋白质 BLAST(CaBLASTP)提供支持,这些工具比所有已知的最先进工具,包括 HHblits、DELTA-BLAST 和 PSI-BLAST 更快、更准确。此外,我们的工具以允许直接替换到现有分析管道的方式实现。关键思想是我们引入了一种基于局部相似性的压缩方案,使我们能够直接对压缩数据进行操作。重要的是,CaBLASTP 的运行时间几乎与独特数据的数量呈线性比例,而不是当前的 BLASTP 变体,后者与正在搜索的完整蛋白质数据库的大小呈线性比例。我们的压缩算法将加速许多任务,如蛋白质结构预测和同源映射,这些任务严重依赖于同源搜索。

可用性

CaBLASTP 可在 http://cablastp.csail.mit.edu/ 下根据 GNU 公共许可证获得。

联系方式

bab@mit.edu

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3df/3851851/bab7887ccf38/btt214f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验