用于加速分子相似性搜索的索引算法基准测试。

Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search.

出版信息

J Chem Inf Model. 2020 Dec 28;60(12):6167-6184. doi: 10.1021/acs.jcim.0c00393. Epub 2020 Oct 23.

DOI:10.1021/acs.jcim.0c00393

Abstract

Structurally similar analogues of given query compounds can be rapidly retrieved from chemical databases by the molecular similarity search approaches. However, the computational cost associated with the exhaustive similarity search of a large compound database will be quite high. Although the latest indexing algorithms can greatly speed up the search process, they cannot be readily applicable to molecular similarity search problems due to the lack of Tanimoto similarity metric implementation. In this paper, we first implement Python or C++ codes to enable the Tanimoto similarity search via several recent indexing algorithms, such as Hnsw and Onng. Moreover, there are increasing interests in computational communities to develop robust benchmarking systems to access the performance of various computational algorithms. Here, we provide a benchmark to evaluate the molecular similarity searching performance of these recent indexing algorithms. To avoid the potential package dependency issues, two separate benchmarks are built based on currently popular container technologies, Docker and Singularity. The Singularity container is a rather new container framework specifically designed for the high-performance computing (HPC) platform and does not need the privileged permissions or the separated daemon process. Both benchmarking methods are extensible to incorporate other new indexing algorithms, benchmarking data sets, and different customized parameter settings. Our results demonstrate that the graph-based methods, such as Hnsw and Onng, consistently achieve the best trade-off between searching effectiveness and searching efficiencies. The source code of the entire benchmark systems can be downloaded from https://github.uconn.edu/mldrugdiscovery/MssBenchmark.

摘要

通过分子相似性搜索方法，可以从化学数据库中快速检索到与给定查询化合物结构相似的类似物。然而，对大型化合物数据库进行穷举相似性搜索的计算成本将会非常高。尽管最新的索引算法可以大大加快搜索过程，但由于缺乏 Tanimoto 相似性度量的实现，它们不能直接应用于分子相似性搜索问题。在本文中，我们首先实现了 Python 或 C++代码，以通过几种最近的索引算法（如 Hnsw 和 Onng）实现 Tanimoto 相似性搜索。此外，计算社区越来越有兴趣开发强大的基准测试系统，以评估各种计算算法的性能。在这里，我们提供了一个基准来评估这些最近的索引算法的分子相似性搜索性能。为了避免潜在的软件包依赖问题，我们基于当前流行的容器技术（Docker 和 Singularity）分别构建了两个基准。Singularity 容器是一个专门为高性能计算（HPC）平台设计的新型容器框架，不需要特权权限或单独的守护进程。这两种基准测试方法都可以扩展到包含其他新的索引算法、基准测试数据集和不同的定制参数设置。我们的结果表明，基于图的方法，如 Hnsw 和 Onng，在搜索效果和搜索效率之间始终能达到最佳的权衡。整个基准系统的源代码可以从 https://github.uconn.edu/mldrugdiscovery/MssBenchmark 下载。

相似文献

Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search.用于加速分子相似性搜索的索引算法基准测试。

J Chem Inf Model. 2020 Dec 28;60(12):6167-6184. doi: 10.1021/acs.jcim.0c00393. Epub 2020 Oct 23.

Application of kernel functions for accurate similarity search in large chemical databases.核函数在大型化学数据库中精确相似性搜索的应用。

BMC Bioinformatics. 2010 Apr 29;11 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2105-11-S3-S8.

SymDex: increasing the efficiency of chemical fingerprint similarity searches for comparing large chemical libraries by using query set indexing.SymDex：通过查询集索引提高化学指纹相似性搜索比较大型化学库的效率。

J Chem Inf Model. 2012 Aug 27;52(8):1926-35. doi: 10.1021/ci200606t. Epub 2012 Aug 7.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper：用于在Linux集群上进行相似性搜索的一组包装应用程序。

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

GRAPES-DD: exploiting decision diagrams for index-driven search in biological graph databases.GRAPES-DD：利用决策图进行生物图谱数据库中的索引驱动搜索。

BMC Bioinformatics. 2021 Apr 22;22(1):209. doi: 10.1186/s12859-021-04129-0.

G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.G-Hash：迈向大型图数据库中基于内核的快速相似性搜索

Adv Database Technol. 2009;360:472-480. doi: 10.1145/1516360.1516416.

Dbtop: topomer similarity searching of conventional structure databases.Dbtop：传统结构数据库的拓扑异构体相似性搜索

J Mol Graph Model. 2002 Jun;20(6):447-62. doi: 10.1016/s1093-3263(01)00146-2.

Efficient protein structure search using indexing methods.利用索引方法进行高效的蛋白质结构搜索。

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S8. doi: 10.1186/1472-6947-13-s1-s8. Epub 2013 Apr 5.

HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST：并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。

J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.

Bioinformatics. 2010 Apr 1;26(7):953-9. doi: 10.1093/bioinformatics/btq067. Epub 2010 Feb 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于加速分子相似性搜索的索引算法基准测试。

Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search.

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献