虚拟筛选中富集度测量方法的探讨：比较不同复杂程度描述符的信息含量。

A discussion of measures of enrichment in virtual screening: comparing the information content of descriptors with increasing levels of sophistication.

作者信息

Bender Andreas, Glen Robert C

机构信息

Unilever Centre for Molecular Science Informatics, Chemistry Department, University of Cambridge, Cambridge CB2 1EW, United Kingdom.

出版信息

J Chem Inf Model. 2005 Sep-Oct;45(5):1369-75. doi: 10.1021/ci0500177.

DOI:10.1021/ci0500177

PMID:16180913

Abstract

We have performed virtual screening using some very simple features, by employing the number of atoms per element as molecular descriptors but without regard to any structural information whatsoever. Surprisingly, these atom counts are able to outperform virtual-affinity-based fingerprints and Unity fingerprints in some activity classes. Although molecular weight and other biases were known in target-based virtual screening settings (docking), we report the effect of using very simple descriptors for ligand-based virtual screening, by using clearly defined biological targets and employing a large data set (>100,000 compounds) containing multiple (11) activity classes. Structure-unaware atom count vectors as descriptors in combination with the Euclidean distance measure are able to achieve "enrichment factors" over random selection of around 4 (depending on the particular class of active compounds), putting the enrichment factors reported for more sophisticated virtual screening methods in a different light. They are also able to retrieve active compounds with novel scaffolds instead of merely the expected structural analogues. The added value of many currently used virtual screening methods (calculated as enrichment factors) drops down to a factor of between 1 and 2, instead of often reported double-digit figures. The observed effect is much less profound for simple descriptors such as molecular weight and is only present in cases of atypical (larger) ligands. The current state of virtual screening is not as sophisticated as might be expected, which is due to descriptors still not being able to capture structural properties relevant to binding. This fact can partly be explained by highly nonlinear structure-activity relationships, which represent a severe limitation of the "similar property principle" in the context of bioactivity.

摘要

我们通过使用一些非常简单的特征进行了虚拟筛选，即将每种元素的原子数用作分子描述符，而不考虑任何结构信息。令人惊讶的是，在某些活性类别中，这些原子计数能够优于基于虚拟亲和力的指纹和Unity指纹。尽管在基于靶点的虚拟筛选设置（对接）中已知分子量和其他偏差，但我们报告了通过使用明确定义的生物靶点并采用包含多种（11种）活性类别的大数据集（>100,000种化合物），使用非常简单的描述符进行基于配体的虚拟筛选的效果。将结构无关的原子计数向量作为描述符与欧几里得距离度量相结合，能够在随机选择的基础上实现约4倍的“富集因子”（取决于活性化合物的特定类别），这使针对更复杂虚拟筛选方法报告的富集因子有了不同的解读。它们还能够检索具有新型支架的活性化合物，而不仅仅是预期的结构类似物。许多当前使用的虚拟筛选方法的附加值（以富集因子计算）降至1到2倍之间，而不是经常报告的两位数数字。对于分子量等简单描述符，观察到的效果要小得多，并且仅在非典型（较大）配体的情况下才会出现。虚拟筛选的当前状态并不像预期的那样复杂，这是由于描述符仍然无法捕捉与结合相关的结构特性。这一事实部分可以由高度非线性的构效关系来解释，这在生物活性的背景下代表了“相似性质原则”的严重局限性。