Suppr超能文献

化学指纹项目。

The chemfp project.

作者信息

Dalke Andrew

机构信息

Andrew Dalke Scientific AB, Trollhättan, Sweden.

出版信息

J Cheminform. 2019 Dec 5;11(1):76. doi: 10.1186/s13321-019-0398-8.

Abstract

The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.

摘要

chemfp项目有四个主要目标:(1)推广FPS格式,使其成为用于密集二进制化学信息学指纹的基于文本的交换格式;(2)开发BitBound算法的高性能实现,可将其用作基准新相似性搜索实现的有效基线;(3)尝试通过商业销售为一个纯开源软件项目提供资金;(4)公布结果和经验教训,为未来的实现者提供指导。FPS格式仅取得了有限的成功,不过它确实影响了FPB二进制格式的开发,FPB格式加载速度更快但更复杂。两者都进行了总结。chemfp基准测试和chemfp的免费/开源版本被提议作为评估其他相似性搜索工具有效性的参考基线。它们被用于评估速度更快的chemfp商业版本,该版本在标准x86-64服务器机器的单核上每秒可测试1.3亿个1024位指纹的塔尼莫托系数。与BitBound算法结合使用时,对ChEMBL 24的180万个2048位摩根指纹进行k = 1000最近邻搜索,平均每次查询耗时27毫秒。对9.7亿个PubChem指纹进行相同搜索,平均每次查询耗时220毫秒,这使chemfp成为基于CPU的最快相似性搜索实现之一。现代CPU速度足够快,以至于内存带宽和延迟现在成为重要因素。单线程搜索使用了大部分可用内存带宽。按布隆过滤器计数对指纹进行排序可提高内存一致性,与4个OpenMP线程结合使用时,大约30分钟内就可以为100万个指纹构建一个N×N相似性矩阵。这些观察结果可能会影响之前那些假设搜索受CPU严重限制的出版物的解读。chemfp项目的资金来自销售一个纯开源软件产品。尝试了几种产品商业模式,但都证明不可持续。讨论了其中一些经验,以便为正在进行的关于开源软件在化学信息学中作用的讨论做出贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2652/6896769/5b1d380d6067/13321_2019_398_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验