School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.
J Chem Inf Model. 2010 Aug 23;50(8):1358-68. doi: 10.1021/ci100132g.
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.
在许多大型化学信息学数据库系统中,分子由长的二进制指纹向量表示,其分量记录特定官能团或组合特征的存在或不存在。为了加快数据库搜索速度,我们建议在每个指纹上添加一个长度为 M 的短签名整数向量。对于给定的指纹,签名向量的第 i 个分量计数指纹中落在与 i 模 M 相等的分量上的 1 位的数量。对于两个签名,我们展示了如何使用交集界快速计算两个对应指纹的 Jaccard-Tanimoto 相似性度量的界。因此,这些签名允许通过丢弃与不利界相关的分子来显著修剪搜索空间。开发了分析方法来预测 M 的函数作为修剪量的结果。还开发了结合不同 M 值的数据结构以及为给定实现预测 M 的最佳值的方法。使用特定实现的模拟表明,与线性搜索相比,所提出的方法可实现 1 个数量级的加速,与之前的实现相比可实现 3 倍的加速。所有理论结果和预测都通过使用 ChemDB 中的分子进行的大规模模拟得到证实。讨论了几种可能的算法扩展。