基于压缩的基因组学蛋白质数据库。

Compressive genomics for protein databases.

机构信息

Department of Computer Science, Tufts University, Medford, MA 02451, USA.

出版信息

Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.

DOI:10.1093/bioinformatics/btt214

PMID:23812995

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3851851/

Abstract

MOTIVATION

The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.

RESULTS

We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.

AVAILABILITY

CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/

CONTACT

bab@mit.edu.

摘要

动机

蛋白质序列数据库的指数级增长使得搜索同源物的基本问题成为计算瓶颈。然而，独特数据的数量增长速度却并不快；我们可以利用这一事实来大大加速同源搜索。流行的 PSI/DELTA-BLAST 工具家族中的程序加速不仅会直接加速同源搜索，还会加速其他目前主要通过这些工具与大型蛋白质数据库交互的大量现有程序。

结果

我们引入了一套同源搜索工具，由压缩加速的蛋白质 BLAST（CaBLASTP）提供支持，这些工具比所有已知的最先进工具，包括 HHblits、DELTA-BLAST 和 PSI-BLAST 更快、更准确。此外，我们的工具以允许直接替换到现有分析管道的方式实现。关键思想是我们引入了一种基于局部相似性的压缩方案，使我们能够直接对压缩数据进行操作。重要的是，CaBLASTP 的运行时间几乎与独特数据的数量呈线性比例，而不是当前的 BLASTP 变体，后者与正在搜索的完整蛋白质数据库的大小呈线性比例。我们的压缩算法将加速许多任务，如蛋白质结构预测和同源映射，这些任务严重依赖于同源搜索。

可用性

CaBLASTP 可在 http://cablastp.csail.mit.edu/ 下根据 GNU 公共许可证获得。

联系方式

bab@mit.edu。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3df/3851851/bab7887ccf38/btt214f1p.jpg

相似文献

Compressive genomics for protein databases.基于压缩的基因组学蛋白质数据库。

Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.

Comparing compressed sequences for faster nucleotide BLAST searches.比较压缩序列以进行更快的核苷酸BLAST搜索。

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):349-64. doi: 10.1109/TCBB.2007.1029.

Fast batch searching for protein homology based on compression and clustering.基于压缩和聚类的蛋白质同源性快速批量搜索

BMC Bioinformatics. 2017 Nov 21;18(1):508. doi: 10.1186/s12859-017-1938-8.

Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.用于蛋白质同源物的迭代序列/二级结构搜索：与氨基酸序列比对的比较及在基因组数据库中折叠识别的应用

Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.

Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix.使用带有敏感矩阵的启发式算法加速蛋白质序列相似性搜索。

J Struct Funct Genomics. 2016 Dec;17(4):147-154. doi: 10.1007/s10969-016-9210-4. Epub 2017 Jan 12.

Database similarity searches.数据库相似性搜索。

Methods Mol Biol. 2008;484:361-78. doi: 10.1007/978-1-59745-398-1_24.

Domain enhanced lookup time accelerated BLAST.基于域名的快速检索 BLAST。

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

Sequence Similarity Searching.序列相似性搜索

Curr Protoc Protein Sci. 2019 Feb;95(1):e71. doi: 10.1002/cpps.71. Epub 2018 Aug 13.

G-BLASTN: accelerating nucleotide alignment by graphics processors.G-BLASTN：通过图形处理器加速核苷酸比对。

Bioinformatics. 2014 May 15;30(10):1384-91. doi: 10.1093/bioinformatics/btu047. Epub 2014 Jan 24.

A performance enhanced PSI-BLAST based on hybrid alignment.基于混合比对的性能增强 PSI-BLAST。

Bioinformatics. 2011 Jan 1;27(1):31-7. doi: 10.1093/bioinformatics/btq621. Epub 2010 Nov 24.

引用本文的文献

Efficient and robust search of microbial genomes via phylogenetic compression.通过系统发育压缩对微生物基因组进行高效且稳健的搜索。

Nat Methods. 2025 Apr;22(4):692-697. doi: 10.1038/s41592-025-02625-2. Epub 2025 Apr 9.

Image-centric compression of protein structures improves space savings.以图像为中心的蛋白质结构压缩可提高节省空间的效果。

BMC Bioinformatics. 2023 Nov 21;24(1):437. doi: 10.1186/s12859-023-05570-z.

Levenshtein Distance, Sequence Comparison and Biological Database Search.莱文斯坦距离、序列比较与生物数据库搜索。

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.AC2：一种使用人工神经网络和缓存哈希模型的高效蛋白质序列压缩工具。

Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530.

Single Cell Genomics Reveals Viruses Consumed by Marine Protists.单细胞基因组学揭示海洋原生生物消耗的病毒。

Front Microbiol. 2020 Sep 24;11:524828. doi: 10.3389/fmicb.2020.524828. eCollection 2020.

HFSP: high speed homology-driven function annotation of proteins.HFSP：高速同源驱动的蛋白质功能注释。

Bioinformatics. 2018 Jul 1;34(13):i304-i312. doi: 10.1093/bioinformatics/bty262.

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.螳螂：一种快速、小巧、精确的大规模序列搜索索引。

Cell Syst. 2018 Aug 22;7(2):201-207.e4. doi: 10.1016/j.cels.2018.05.021. Epub 2018 Jun 20.

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.使用拆分序列布隆树改进对大型转录组测序数据库的搜索

J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.

Fast batch searching for protein homology based on compression and clustering.基于压缩和聚类的蛋白质同源性快速批量搜索

BMC Bioinformatics. 2017 Nov 21;18(1):508. doi: 10.1186/s12859-017-1938-8.

Computational Biology in the 21st Century: Scaling with Compressive Algorithms.21世纪的计算生物学：借助压缩算法实现规模扩展。

Commun ACM. 2016 Aug;59(8):72-80. doi: 10.1145/2957324.

本文引用的文献

Compressive genomics.压缩基因组学

Nat Biotechnol. 2012 Jul 10;30(7):627-30. doi: 10.1038/nbt.2241.

Domain enhanced lookup time accelerated BLAST.基于域名的快速检索 BLAST。

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone.SMURFLite：将简化的马尔可夫随机场与模拟进化相结合，可提高β结构蛋白远程同源检测的进入黄昏区的性能。

Bioinformatics. 2012 May 1;28(9):1216-22. doi: 10.1093/bioinformatics/bts110. Epub 2012 Mar 9.

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.HHblits：通过 HMM-HMM 比对进行快速迭代的蛋白质序列搜索。

Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.

Saccharomyces Genome Database: the genomics resource of budding yeast.酿酒酵母基因组数据库：芽殖酵母的基因组资源。

Nucleic Acids Res. 2012 Jan;40(Database issue):D700-5. doi: 10.1093/nar/gkr1029. Epub 2011 Nov 21.

Critical assessment of methods of protein structure prediction (CASP)--round IX.蛋白质结构预测方法的关键评估（CASP）——第九轮。

Proteins. 2011;79 Suppl 10(0 10):1-5. doi: 10.1002/prot.23200. Epub 2011 Oct 14.

Riding the wave of biological data.顺应生物数据的浪潮。

Curr Biol. 2011 Mar 22;21(6):R204-6. doi: 10.1016/j.cub.2011.03.009.

On the future of genomic data.论基因组数据的未来。

Science. 2011 Feb 11;331(6018):728-9. doi: 10.1126/science.1197891.

Cloud computing and the DNA data race.云计算与DNA数据竞赛。

Nat Biotechnol. 2010 Jul;28(7):691-3. doi: 10.1038/nbt0710-691.

Recognition of beta-structural motifs using hidden Markov models trained with simulated evolution.使用模拟进化训练的隐马尔可夫模型识别β结构基序。

Bioinformatics. 2010 Jun 15;26(12):i287-93. doi: 10.1093/bioinformatics/btq199.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于压缩的基因组学蛋白质数据库。

Compressive genomics for protein databases.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系方式

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献