Yoo Hyunwoo, Refahi Mohammadsaleh, Polikar Robi, Sokhansanj Bahrad A, Brown James R, Rosen Gail L
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States of America.
Electrical and Computer Engineering, Rowan University, Glassboro, NJ, United States of America.
PeerJ. 2025 Apr 28;13:e19171. doi: 10.7717/peerj.19171. eCollection 2025.
The advancement of sequencing technology has led to a rapid increase in the amount of DNA and protein sequence data; consequently, the size of genomic and proteomic databases is constantly growing. As a result, database searches need to be continually updated to account for the new data being added. However, continually re-searching the entire existing dataset wastes resources. Incremental database search can address this problem.
One recently introduced incremental search method is iBlast, which wraps the BLAST sequence search method with an algorithm to reuse previously processed data and thereby increase search efficiency. The iBlast wrapper, however, must be generalized to support better performing DNA/protein sequence search methods that have been developed, namely MMseqs2 and Diamond. To address this need, we propose iSeqsSearch, which extends iBlast by incorporating support for MMseqs2 (iMMseqs2) and Diamond (iDiamond), thereby providing a more generalized and broadly effective incremental search framework. Moreover, the previously published iBlast wrapper has to be revised to be more robust and usable by the general community.
iMMseqs2 and iDiamond, which apply the incremental approach, perform nearly identical to MMseqs2 and Diamond. Notably, when comparing ranking comparison methods such as the Pearson correlation, we observe a high concordance of over 0.9, indicating similar results. Moreover, in some cases, our incremental approach, iSeqsSearch, which extends the iBlast merge function to iMMseqs2 and iDiamond, provides more hits compared to the conventional MMseqs2 and Diamond methods.
The incremental approach using iMMseqs2 and iDiamond demonstrates efficiency in terms of reusing previously processed data while maintaining high accuracy and concordance in search results. This method can reduce resource waste in continually growing genomic and proteomic database searches. The sample codes and data are available at GitHub and Zenodo (https://github.com/EESI/Incremental-Protein-Search; DOI: 10.5281/zenodo.14675319).
测序技术的进步导致DNA和蛋白质序列数据量迅速增加;因此,基因组和蛋白质组数据库的规模在不断扩大。结果,数据库搜索需要不断更新以纳入新添加的数据。然而,持续重新搜索整个现有数据集会浪费资源。增量数据库搜索可以解决这个问题。
最近引入的一种增量搜索方法是iBlast,它用一种算法包装了BLAST序列搜索方法,以重用先前处理的数据,从而提高搜索效率。然而,iBlast包装器必须进行泛化,以支持已开发的性能更好的DNA/蛋白质序列搜索方法,即MMseqs2和Diamond。为满足这一需求,我们提出了iSeqsSearch,它通过纳入对MMseqs2(iMMseqs2)和Diamond(iDiamond)的支持来扩展iBlast,从而提供一个更通用、更有效的增量搜索框架。此外,之前发布的iBlast包装器必须进行修订,以使其更稳健且便于广大用户使用。
采用增量方法的iMMseqs2和iDiamond的性能与MMseqs2和Diamond几乎相同。值得注意的是,在比较诸如皮尔逊相关性等排名比较方法时,我们观察到一致性超过0.9,表明结果相似。此外,在某些情况下,我们的增量方法iSeqsSearch(它将iBlast合并功能扩展到iMMseqs2和iDiamond)比传统的MMseqs2和Diamond方法提供了更多的命中结果。
使用iMMseqs2和iDiamond的增量方法在重用先前处理的数据方面显示出效率,同时在搜索结果中保持了高精度和一致性。这种方法可以减少在不断增长的基因组和蛋白质组数据库搜索中的资源浪费。示例代码和数据可在GitHub和Zenodo上获取(https://github.com/EESI/Incremental-Protein-Search;DOI:10.5281/zenodo.14675319)。