National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894, USA.
J Cheminform. 2011 Jul 20;3:25. doi: 10.1186/1758-2946-3-25.
PubChem provides a 3-D neighboring relationship, which involves finding the maximal shape overlap between two static compound 3-D conformations, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of pre-filters, based on the concept of volume, to remove approximately 65% of all conformer neighbor pairs prior to shape overlap optimization. Given that molecular volume, a somewhat vague concept, is rather effective, it leads one to wonder: can the existing PubChem 3-D neighboring relationship, which consists of billions of shape similar conformer pairs from tens of millions of unique small molecules, be used to identify additional shape descriptor relationships? Or, put more specifically, can one place an upper bound on shape similarity using other "fuzzy" shape-like concepts like length, width, and height?
Using a basis set of 4.18 billion 3-D neighbor pairs identified from single conformer per compound neighboring of 17.1 million molecules, shape descriptors were computed for all conformers. These steric shape descriptors included several forms of molecular volume and shape quadrupoles, which essentially embody the length, width, and height of a conformer. For a given 3-D neighbor conformer pair, the volume and each quadrupole component (Qx, Qy, and Qz) were binned and their frequency of occurrence was examined. Per molecular volume type, this effectively produced three different maps, one per quadrupole component (Qx, Qy, and Qz), of allowed values for the similarity metric, shape Tanimoto (ST) ≥ 0.8.The efficiency of these relationships (in terms of true positive, true negative, false positive and false negative) as a function of ST threshold was determined in a test run of 13.2 billion conformer pairs not previously considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering efficiency of 40.4% of true negatives was achieved with only 32 false negatives out of 24 million true positives, when applying the separate Qx, Qy, and Qz maps in a series (Qxyz). This efficiency increased linearly as a function of ST threshold in the range 0.8-0.99. The Qx filter was consistently the most efficient followed by Qy and then by Qz. Use of a monopole volume showed the best overall performance, followed by the self-overlap volume and then by the analytic volume.Application of the monopole-based Qxyz filter in a "real world" test of 3-D neighboring of 4,218 chemicals of biomedical interest against 26.1 million molecules in PubChem reduced the total CPU cost of neighboring by between 24-38% and, if used as the initial filter, removed from consideration 48.3% of all conformer pairs at almost negligible computational overhead.
Basic shape descriptors, such as those embodied by size, length, width, and height, can be highly effective in identifying shape incompatible compound conformer pairs. When performing a 3-D search using a shape similarity cut-off, computation can be avoided by identifying conformer pairs that cannot meet the result criteria. Applying this methodology as a filter for PubChem 3-D neighboring computation, an improvement of 31% was realized, increasing the average conformer pair throughput from 154,000 to 202,000 per second per CPU core.
PubChem 提供了一种 3-D 邻近关系,它涉及到找到两个静态化合物 3-D 构象之间的最大形状重叠,这是一个计算密集的步骤。如果可以确定构象对不可能满足成为 3-D 邻居的标准,那么非常希望避免这种重叠计算。因此,PubChem 采用了一系列基于体积概念的预过滤器,在进行形状重叠优化之前,大约去除所有构象邻居对的 65%。由于分子体积是一个相当模糊的概念,但非常有效,这让人不禁想知道:现有的 PubChem 3-D 邻近关系,它由数十亿个来自数千万个独特小分子的形状相似构象对组成,是否可以用于识别其他形状描述符关系?或者,更具体地说,是否可以使用其他“模糊”形状概念,如长度、宽度和高度,为形状相似性设置一个上限?
使用从 1710 万个分子的每个化合物相邻的单个构象中确定的 41.8 亿个 3-D 邻居对的基础集,为所有构象计算了形状描述符。这些立体形状描述符包括几种形式的分子体积和形状四极矩,它们本质上体现了构象的长度、宽度和高度。对于给定的 3-D 邻居构象对,将体积和每个四极矩分量(Qx、Qy 和 Qz)进行分组,并检查它们的出现频率。对于每种分子体积类型,这实际上产生了三个不同的映射,每个映射对应一个四极矩分量(Qx、Qy 和 Qz),用于相似性度量形状 Tanimoto(ST)≥0.8 的允许值。在对 132 亿个构象对进行测试运行时,确定了这些关系的效率(以真阳性、真阴性、假阳性和假阴性为指标),这些构象对之前未被 3-D 邻居集考虑。在 ST≥0.8 时,当在一个系列中应用单独的 Qx、Qy 和 Qz 映射(Qxyz)时,通过过滤可以实现 40.4%的真阴性过滤效率,只有 32 个假阴性中的 2400 万个真阳性,这随着 ST 阈值在 0.8-0.99 范围内线性增加。Qx 过滤器始终是最有效的,其次是 Qy,然后是 Qz。使用单极矩体积显示出最佳的整体性能,其次是自重叠体积,然后是分析体积。在对 4218 种具有生物医学意义的化学物质与 PubChem 中的 2610 万个分子进行 3-D 邻近性的“真实世界”测试中应用基于单极矩的 Qxyz 过滤器,将邻近性的总 CPU 成本降低了 24-38%,如果用作初始过滤器,则几乎可以忽略不计的计算开销,可以排除所有构象对的 48.3%。
基本的形状描述符,如大小、长度、宽度和高度所体现的形状描述符,可以非常有效地识别形状不兼容的化合物构象对。在使用形状相似性截止值进行 3-D 搜索时,可以通过识别无法满足结果标准的构象对来避免计算。将这种方法应用为 PubChem 3-D 邻近计算的过滤器,可以提高 31%的效率,将平均构象对吞吐量从每个 CPU 核心每秒 154000 提高到 202000。