State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University , Beijing 100191, China.
J Chem Inf Model. 2014 May 27;54(5):1433-50. doi: 10.1021/ci500062f. Epub 2014 May 1.
Benchmarking data sets have become common in recent years for the purpose of virtual screening, though the main focus had been placed on the structure-based virtual screening (SBVS) approaches. Due to the lack of crystal structures, there is great need for unbiased benchmarking sets to evaluate various ligand-based virtual screening (LBVS) methods for important drug targets such as G protein-coupled receptors (GPCRs). To date these ready-to-apply data sets for LBVS are fairly limited, and the direct usage of benchmarking sets designed for SBVS could bring the biases to the evaluation of LBVS. Herein, we propose an unbiased method to build benchmarking sets for LBVS and validate it on a multitude of GPCRs targets. To be more specific, our methods can (1) ensure chemical diversity of ligands, (2) maintain the physicochemical similarity between ligands and decoys, (3) make the decoys dissimilar in chemical topology to all ligands to avoid false negatives, and (4) maximize spatial random distribution of ligands and decoys. We evaluated the quality of our Unbiased Ligand Set (ULS) and Unbiased Decoy Set (UDS) using three common LBVS approaches, with Leave-One-Out (LOO) Cross-Validation (CV) and a metric of average AUC of the ROC curves. Our method has greatly reduced the "artificial enrichment" and "analogue bias" of a published GPCRs benchmarking set, i.e., GPCR Ligand Library (GLL)/GPCR Decoy Database (GDD). In addition, we addressed an important issue about the ratio of decoys per ligand and found that for a range of 30 to 100 it does not affect the quality of the benchmarking set, so we kept the original ratio of 39 from the GLL/GDD.
近年来,基准数据集已成为虚拟筛选的常用方法,尽管主要关注点一直放在基于结构的虚拟筛选 (SBVS) 方法上。由于缺乏晶体结构,因此非常需要无偏基准数据集来评估各种配体为基础的虚拟筛选 (LBVS) 方法,以用于 G 蛋白偶联受体 (GPCR) 等重要药物靶标。迄今为止,这些可直接用于 LBVS 的现成基准数据集相当有限,而直接使用专为 SBVS 设计的基准数据集可能会给 LBVS 的评估带来偏差。在此,我们提出了一种构建 LBVS 基准数据集的无偏方法,并在多种 GPCR 靶标上对其进行了验证。具体而言,我们的方法可以 (1) 确保配体的化学多样性,(2) 保持配体和诱饵之间的物理化学相似性,(3) 使诱饵在化学拓扑上与所有配体不同,以避免假阴性,以及 (4) 最大化配体和诱饵的空间随机分布。我们使用三种常见的 LBVS 方法,通过留一法交叉验证 (LOO-CV) 和 ROC 曲线平均 AUC 的度量标准,评估了我们的无偏配体集 (ULS) 和无偏诱饵集 (UDS) 的质量。我们的方法大大降低了已发表的 GPCR 基准数据集,即 GPCR 配体库 (GLL)/GPCR 诱饵数据库 (GDD) 的“人为富集”和“类似物偏差”。此外,我们解决了关于诱饵与配体比例的一个重要问题,发现对于 30 到 100 的范围,它不会影响基准数据集的质量,因此我们保留了 GLL/GDD 中的原始比例 39。