• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

iSeqSearch:用于iBlast/iMMSeqs2/iDiamond的增量蛋白质搜索

iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond.

作者信息

Yoo Hyunwoo, Refahi Mohammadsaleh, Polikar Robi, Sokhansanj Bahrad A, Brown James R, Rosen Gail L

机构信息

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States of America.

Electrical and Computer Engineering, Rowan University, Glassboro, NJ, United States of America.

出版信息

PeerJ. 2025 Apr 28;13:e19171. doi: 10.7717/peerj.19171. eCollection 2025.

DOI:10.7717/peerj.19171
PMID:40313391
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12045279/
Abstract

BACKGROUND

The advancement of sequencing technology has led to a rapid increase in the amount of DNA and protein sequence data; consequently, the size of genomic and proteomic databases is constantly growing. As a result, database searches need to be continually updated to account for the new data being added. However, continually re-searching the entire existing dataset wastes resources. Incremental database search can address this problem.

METHODS

One recently introduced incremental search method is iBlast, which wraps the BLAST sequence search method with an algorithm to reuse previously processed data and thereby increase search efficiency. The iBlast wrapper, however, must be generalized to support better performing DNA/protein sequence search methods that have been developed, namely MMseqs2 and Diamond. To address this need, we propose iSeqsSearch, which extends iBlast by incorporating support for MMseqs2 (iMMseqs2) and Diamond (iDiamond), thereby providing a more generalized and broadly effective incremental search framework. Moreover, the previously published iBlast wrapper has to be revised to be more robust and usable by the general community.

RESULTS

iMMseqs2 and iDiamond, which apply the incremental approach, perform nearly identical to MMseqs2 and Diamond. Notably, when comparing ranking comparison methods such as the Pearson correlation, we observe a high concordance of over 0.9, indicating similar results. Moreover, in some cases, our incremental approach, iSeqsSearch, which extends the iBlast merge function to iMMseqs2 and iDiamond, provides more hits compared to the conventional MMseqs2 and Diamond methods.

CONCLUSION

The incremental approach using iMMseqs2 and iDiamond demonstrates efficiency in terms of reusing previously processed data while maintaining high accuracy and concordance in search results. This method can reduce resource waste in continually growing genomic and proteomic database searches. The sample codes and data are available at GitHub and Zenodo (https://github.com/EESI/Incremental-Protein-Search; DOI: 10.5281/zenodo.14675319).

摘要

背景

测序技术的进步导致DNA和蛋白质序列数据量迅速增加;因此,基因组和蛋白质组数据库的规模在不断扩大。结果,数据库搜索需要不断更新以纳入新添加的数据。然而,持续重新搜索整个现有数据集会浪费资源。增量数据库搜索可以解决这个问题。

方法

最近引入的一种增量搜索方法是iBlast,它用一种算法包装了BLAST序列搜索方法,以重用先前处理的数据,从而提高搜索效率。然而,iBlast包装器必须进行泛化,以支持已开发的性能更好的DNA/蛋白质序列搜索方法,即MMseqs2和Diamond。为满足这一需求,我们提出了iSeqsSearch,它通过纳入对MMseqs2(iMMseqs2)和Diamond(iDiamond)的支持来扩展iBlast,从而提供一个更通用、更有效的增量搜索框架。此外,之前发布的iBlast包装器必须进行修订,以使其更稳健且便于广大用户使用。

结果

采用增量方法的iMMseqs2和iDiamond的性能与MMseqs2和Diamond几乎相同。值得注意的是,在比较诸如皮尔逊相关性等排名比较方法时,我们观察到一致性超过0.9,表明结果相似。此外,在某些情况下,我们的增量方法iSeqsSearch(它将iBlast合并功能扩展到iMMseqs2和iDiamond)比传统的MMseqs2和Diamond方法提供了更多的命中结果。

结论

使用iMMseqs2和iDiamond的增量方法在重用先前处理的数据方面显示出效率,同时在搜索结果中保持了高精度和一致性。这种方法可以减少在不断增长的基因组和蛋白质组数据库搜索中的资源浪费。示例代码和数据可在GitHub和Zenodo上获取(https://github.com/EESI/Incremental-Protein-Search;DOI:10.5281/zenodo.14675319)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1c9/12045279/6b08b06bc3f2/peerj-13-19171-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1c9/12045279/2f1c5ac93171/peerj-13-19171-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1c9/12045279/6b08b06bc3f2/peerj-13-19171-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1c9/12045279/2f1c5ac93171/peerj-13-19171-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1c9/12045279/6b08b06bc3f2/peerj-13-19171-g002.jpg

相似文献

1
iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond.iSeqSearch:用于iBlast/iMMSeqs2/iDiamond的增量蛋白质搜索
PeerJ. 2025 Apr 28;13:e19171. doi: 10.7717/peerj.19171. eCollection 2025.
2
iBLAST: Incremental BLAST of new sequences via automated e-value correction.iBLAST:通过自动 e 值校正对新序列进行增量 BLAST。
PLoS One. 2021 Apr 22;16(4):e0249410. doi: 10.1371/journal.pone.0249410. eCollection 2021.
3
A large-scale assessment of sequence database search tools for homology-based protein function prediction.基于序列数据库搜索工具的大规模评估用于同源蛋白功能预测。
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae349.
4
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper:用于在Linux集群上进行相似性搜索的一组包装应用程序。
BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.
5
In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.使用多个搜索引擎和明确的指标对蛋白质推断算法进行深入分析。
J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.
6
blastjs: a BLAST+ wrapper for Node.js.blastjs:一个用于Node.js的BLAST+包装器。
BMC Res Notes. 2016 Feb 27;9:130. doi: 10.1186/s13104-016-1938-1.
7
CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search.CUDASW++4.0:基于 GPU 的超快 Smith-Waterman 蛋白质序列数据库搜索。
BMC Bioinformatics. 2024 Nov 2;25(1):342. doi: 10.1186/s12859-024-05965-6.
8
Sequence Similarity Searching.序列相似性搜索
Curr Protoc Protein Sci. 2019 Feb;95(1):e71. doi: 10.1002/cpps.71. Epub 2018 Aug 13.
9
Div-BLAST: diversification of sequence search results.Div-BLAST:序列搜索结果的多样化
PLoS One. 2014 Dec 22;9(12):e115445. doi: 10.1371/journal.pone.0115445. eCollection 2014.
10
muBLASTP: database-indexed protein sequence search on multicore CPUs.muBLASTP:基于多核CPU的数据库索引蛋白质序列搜索。
BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4.

本文引用的文献

1
Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering.Complet+:一种可计算扩展的方法,用于提高大规模蛋白质序列聚类的完整性。
PeerJ. 2023 Feb 8;11:e14779. doi: 10.7717/peerj.14779. eCollection 2023.
2
Longitudinal metatranscriptomic sequencing of Southern California wastewater representing 16 million people from August 2020-21 reveals widespread transcription of antibiotic resistance genes.对2020年8月至2021年代表1600万人的南加州废水进行的纵向宏转录组测序显示,抗生素抗性基因广泛转录。
Water Res. 2023 Feb 1;229:119421. doi: 10.1016/j.watres.2022.119421. Epub 2022 Nov 24.
3
SCOPe: improvements to the structural classification of proteins - extended database to facilitate variant interpretation and machine learning.
SCOPe:蛋白质结构分类的改进——扩展数据库以促进变体解释和机器学习。
Nucleic Acids Res. 2022 Jan 7;50(D1):D553-D559. doi: 10.1093/nar/gkab1054.
4
iBLAST: Incremental BLAST of new sequences via automated e-value correction.iBLAST:通过自动 e 值校正对新序列进行增量 BLAST。
PLoS One. 2021 Apr 22;16(4):e0249410. doi: 10.1371/journal.pone.0249410. eCollection 2021.
5
Sensitive protein alignments at tree-of-life scale using DIAMOND.使用 DIAMOND 进行生命之树尺度上的敏感蛋白质比对。
Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.
6
Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.跟上基因组的步伐:高效学习我们日益增长的生命之树知识。
BMC Bioinformatics. 2020 Sep 21;21(1):412. doi: 10.1186/s12859-020-03744-7.
7
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索,以分析海量数据集。
Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.
8
SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.SCOPe:蛋白质结构分类——扩展版,整合了 SCOP 和 ASTRAL 数据以及新结构的分类。
Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9. doi: 10.1093/nar/gkt1240. Epub 2013 Dec 3.
9
New finite-size correction for local alignment score distributions.局部比对得分分布的新有限尺寸校正。
BMC Res Notes. 2012 Jun 12;5:286. doi: 10.1186/1756-0500-5-286.
10
The proof and measurement of association between two things. By C. Spearman, 1904.两件事物之间关联的证明与度量。作者C. 斯皮尔曼,1904年。
Am J Psychol. 1987 Fall-Winter;100(3-4):441-71.