Kristensen Thomas G, Nielsen Jesper, Pedersen Christian N S
Bioinformatics Research Center, Aarhus University, CF Møllers Allé 8, DK-8000 Arhus C, Denmark.
Algorithms Mol Biol. 2010 Jan 4;5:9. doi: 10.1186/1748-7188-5-9.
The fingerprint of a molecule is a bitstring based on its structure, constructed such that structurally similar molecules will have similar fingerprints. Molecular fingerprints can be used in an initial phase of drug development for identifying novel drug candidates by screening large databases for molecules with fingerprints similar to a query fingerprint.
In this paper, we present a method which efficiently finds all fingerprints in a database with Tanimoto coefficient to the query fingerprint above a user defined threshold. The method is based on two novel data structures for rapid screening of large databases: the kD grid and the Multibit tree. The kD grid is based on splitting the fingerprints into k shorter bitstrings and utilising these to compute bounds on the similarity of the complete bitstrings. The Multibit tree uses hierarchical clustering and similarity within each cluster to compute similar bounds. We have implemented our method and tested it on a large real-world data set. Our experiments show that our method yields approximately a three-fold speed-up over previous methods.
Using the novel kD grid and Multibit tree significantly reduce the time needed for searching databases of fingerprints. This will allow researchers to (1) perform more searches than previously possible and (2) to easily search large databases.
分子指纹是基于其结构的一个位串,构建方式使得结构相似的分子具有相似的指纹。分子指纹可用于药物开发的初始阶段,通过在大型数据库中筛选指纹与查询指纹相似的分子来识别新型候选药物。
在本文中,我们提出了一种方法,该方法能有效地在数据库中找到所有与查询指纹的塔尼莫托系数高于用户定义阈值的指纹。该方法基于两种用于快速筛选大型数据库的新型数据结构:kd网格和多位树。kd网格基于将指纹拆分为k个较短的位串,并利用这些位串来计算完整位串相似性的边界。多位树使用层次聚类和每个聚类内的相似性来计算相似的边界。我们已经实现了我们的方法,并在一个大型真实数据集上进行了测试。我们的实验表明,我们的方法比以前的方法速度提高了约三倍。
使用新型的kd网格和多位树显著减少了搜索指纹数据库所需的时间。这将使研究人员能够(1)比以前进行更多的搜索,以及(2)轻松搜索大型数据库。